# Draft Proposed RISC-V Composable Custom Extensions Specification

Version 0.92.231111, 2023-11-11: Draft

# Table of Contents

| Preface                                                                          | 1  |
|----------------------------------------------------------------------------------|----|
| 1. Introduction: a composable custom extension ecosystem                         | 2  |
| 1.1. Open, agile, interoperable instruction set innovation                       | 2  |
| 1.2. Examples                                                                    | 3  |
| 1.3. Scope: reliable composition via strict isolation                            | 4  |
| 1.3.1. Stateless and stateful composable extensions                              | 4  |
| 1.4. Standard extensions and formats                                             | 4  |
| 1.4.1. CXU Logic Interface (CXU-LI)                                              | 5  |
| 1.4.2Zicx: composable extensions extension                                       | 5  |
| 1.4.3. Composable extension multiplexing                                         | 6  |
| 1.4.4. IStateContext and serializable stateful composable extensions             | 6  |
| 1.4.5. CX Application Programming Interface and CX-ABI                           | 7  |
| 1.5. System composition                                                          | 7  |
| 1.5.1. Metadata and system manifest                                              | 7  |
| 1.5.2. Composer                                                                  | 7  |
| 1.5.3. Diversity of systems and operating systems                                | 8  |
| 1.6. Versioning                                                                  | 8  |
| 1.7. Pushing the envelope                                                        | 9  |
| 1.8. Future directions, TODOs                                                    | 9  |
| 1.9. Acknowledgements                                                            | 9  |
| 2. Composable extensions: the hardware-software interface                        | 10 |
| 2.1. Definitions                                                                 | 10 |
| 2.2. New CX control / status registers                                           | 11 |
| 2.2.1. mcx_selector CSR OxBCO: select active CXU and state context               | 11 |
| 2.2.2. cx_status CSR 0x801: CX status                                            | 12 |
| 2.2.3. mcx_table CSR OxBC1: CX selector table base                               | 14 |
| 2.2.4. cx_index CSR 0x800: CX selector index                                     | 14 |
| 2.2.5. Implicit CXU CSR fences                                                   | 14 |
| 2.3. Custom function instruction encodings                                       | 14 |
| 2.3.1. Custom-O R-type encoding                                                  | 15 |
| 2.3.2. Custom-1 I-type encoding                                                  | 15 |
| 2.3.3. Custom-2 flex-type encoding                                               | 15 |
| 2.4. Custom function instruction execution via composable extension multiplexing | 16 |
| 2.4.1. Precise exceptions                                                        | 17 |
| 2.5. IStateContext: the standard custom functions                                | 18 |
| 2.5.1. Interface state context status word                                       | 19 |
| 2.5.2. cx_read_status standard custom function instruction                       | 20 |
| 2.5.3. cx_write_status standard custom function instruction                      | 20 |
| 2.5.4. cx_read_state standard custom function instruction                        | 21 |
| 2.5.5. cx_write_state standard custom function instruction                       | 21 |
| 2.6. Resource management and context switching                                   | 21 |

|    | 2.7. CX access control                               | . 23 |
|----|------------------------------------------------------|------|
| 3. | Composable Extension Unit Logic Interface            | . 24 |
|    | 3.1. Definitions                                     | . 24 |
|    | 3.2. Example configured system                       | . 24 |
|    | 3.3. CXU-LI feature levels                           | . 25 |
|    | 3.3.1. CXU-L0: combinational CXU                     | . 25 |
|    | 3.3.2. CXU-L1: fixed latency CXU                     | . 25 |
|    | 3.3.3. CXU-L2: variable latency CXU                  | . 25 |
|    | 3.3.4. CXU-L3: reordering CXU                        | . 26 |
|    | 3.3.5. Feature levels summary                        | . 26 |
|    | 3.4. CXU-LI signaling                                | . 26 |
|    | 3.4.1. CXU-LI configuration parameters               | . 27 |
|    | 3.4.2. Clock, reset, clock enable                    | . 28 |
|    | 3.4.3. Request and response valid-ready flow control | . 28 |
|    | 3.4.4. Response status / error checking.             | . 29 |
|    | 3.4.5. Raw instruction                               | . 30 |
|    | 3.4.6. Request-response ID                           | . 30 |
|    | 3.5. CXU-LO combinational CXU signaling              | . 31 |
|    | 3.5.1. CXU-LO configuration parameters               | . 31 |
|    | 3.5.2. CXU-LO signals                                | . 31 |
|    | 3.5.3. CXU-LO signaling protocol                     | . 31 |
|    | 3.5.4. CXU-LO example                                | . 32 |
|    | 3.6. CXU-L1 fixed latency CXU signaling              | . 32 |
|    | 3.6.1. CXU-L1 configuration parameters.              | . 32 |
|    | 3.6.2. CXU-L1 signals                                | . 33 |
|    | 3.6.3. CXU-L1 signaling protocol                     | . 33 |
|    | 3.6.4. CXU-L1 example                                | . 34 |
|    | 3.7. CXU-L2 variable latency CXU signaling           | . 34 |
|    | 3.7.1. CXU-L2 configuration parameters               | . 34 |
|    | 3.7.2. CXU-L2 signals                                | . 35 |
|    | 3.7.3. CXU-L2 signaling protocol                     | . 35 |
|    | 3.7.4. CXU-L2 example                                | . 36 |
|    | 3.8. CXU-L3 reordering CXU signaling                 | . 37 |
|    | 3.8.1. CXU-L3 configuration parameters               | . 37 |
|    | 3.8.2. CXU-L3 signals                                | . 37 |
|    | 3.8.3. CXU-L3 signaling protocol                     | . 38 |
|    | 3.8.4. CXU-L3 example                                | . 38 |
|    | 3.9. CXU feature level adapters                      | . 39 |
|    | 3.9.1. Cvt01: raise CXU-L0 to CXU-L1                 | . 40 |
|    | 3.9.2. Cvt02: raise CXU-LO to CXU-L2.                | . 40 |
|    | 3.9.3. Cvt12: raise CXU-L1 to CXU-L2                 | . 40 |
|    | 3.10. CXU-LI-compliant CPUs                          | . 40 |
|    | 3.10.1. CPUs and CXU-LI feature levels               | . 40 |
|    | 3.11. Example: CXU signaling in a composed system    | . 41 |

| 3.12. Composing CXUs with AXI4-Streams                                         | 45 |
|--------------------------------------------------------------------------------|----|
| 4. CXU Metadata (CXU-MD)                                                       | 47 |
| 4.1. CXU Metadata                                                              | 47 |
| 4.2. Example CXU metadata                                                      | 48 |
| 4.3. CPU Metadata                                                              |    |
| 4.4. Example CPU metadata                                                      | 48 |
| 4.5. System manifest                                                           | 49 |
| 5. TODO                                                                        | 50 |
| 5.1. Open design problems (post 1.0)                                           |    |
| 5.2. Cost model                                                                |    |
| 6. Specification Change History                                                | 51 |
| 6.1. Version 0.92.231111, 2023-11-11: Add extension multiplexing <i>type</i> . | 51 |
| 6.2. Version 0.91.230803, 2023-08-03: Simplify and improve terminology.        | 51 |
| 6.3. Version 0.90.220327, 2022-03-27: First complete draft.                    | 51 |
| References                                                                     | 52 |

# Preface

This document comprises draft proposed specifications for hardware-software and hardware-hardware interfaces, formats, and metadata, enabling independent, efficient, and robust composition of diverse composable custom instruction set extensions, composable extension units hardware, and composable extension libraries software.

It is a work in progress. We request your feedback.

At present this is not a work product of a RISC-V International SIG, Task Group, or subcommittee. Rather we share this work in the hope that it may motivate and inform two hypothetical RISC-V International Task Groups: 1) ISA: CX-ISA TG: Composable Extensions (-Zicx); and 2) non-ISA: CXU-LI TG: Composable Extension Unit Logic Interface.

(Pending standardization, implementers might elect to implement the present specifications as their own *custom extension*.)

This work summarizes years of ongoing discussions and prototyping by (alphabetical order): Tim Ansell, Tim Callahan, Jan Gray, Karol Gugala, Olof Kingdren, Maciej Kurc, Guy Lemieux, Charles Papon, Zdenek Prikryl, Tim Vogt.

Copyright © 2019-2023, Jan Gray <jan@fpga.org> Copyright © 2019-2023, Tim Vogt

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at <a href="https://www.apache.org/licenses/LICENSE-2.0.html">www.apache.org/licenses/LICENSE-2.0.html</a>.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

This work incorporates design elements from the RISC-V documentation template github.com/riscv/docs-dev-guide which uses a Creative Commons Attribution 4.0 International ("CC BY 4.0") license. It is built using the asciidoc docker tools image riscvintl/rv-docs.

RISC-V is a registered trade mark of RISC-V International.

# 1. Introduction: a composable custom extension ecosystem



Tip blocks signify non-normative commentary. This Introduction is non-normative. Sections titled Example are non-normative.



Note blocks signify review comments: open issues, suggested improvements.

SoC designs employ application-specific hardware accelerators to improve performance and reduce energy use — particularly so with FPGA SoCs that offer both plasticity and abundant spatial parallelism. The RISC-V instruction set architecture (ISA) anticipates this and invites domain-specific custom instructions within the base ISA (Waterman & Asanović, 2019, p. 5).

There are many RISC-V processors with custom instruction extensions, and now some vendor tooling for creating them. But the software libraries that use these extensions and the cores that implement them are authored by different organizations, using different tools, and might not work together side-by-side in a new system. Different composable extensions may conflict in use of opcodes, or their implementations may require different CPU cores, pipeline structures, logic extensions, models of computation, means of discovery, or error reporting regimes. Composition is difficult, impairing reuse of hardware and software, and fragmenting the RISC-V ecosystem.

The RISC-V Composable Custom Extensions Specification introduces a set of hardware-hardware and hardware-software extensions and metadata designed to make it easy to create, compose, reuse, version, program, and deploy systems with multiple composable extensions and their libraries, enabling an open ecosystem, and marketplace, of composable custom extensions' hardware and software.

# 1.1. Open, agile, interoperable instruction set innovation

RISC-V International uses a community process to define a new optional standard extension to the RISC-V instruction set architecture. Candidate extensions must be of broad interest and general utility to justify the permanent allocation of precious RISC-V opcode space, CSR space, and more generally to add to the enduring, essential complexity of the RISC-V platform. New standard extensions typically require months or years to reach consensus and ratification.

In contrast, the extensions defined in this specification allow anyone, whether individual, organization, or consortium, to rapidly define, develop, and use:

- a *composable extension (CX)*: a composable custom extension consisting of a set of *custom function (CF)* instructions;
- a composable extension unit (CXU): a composable hardware core that implements a composable extension;
- an accelerated *composable extension library* that issues custom functions of composable extensions;
- a processor that can use any CXU;
- tools to create or consume these elements; and
- to compose these arbitrarily into a system of hardware accelerated software libraries.

There need be *no central authority*, no lock in, no lock out, and no asking for permission. Composable extensions, their CXUs and libraries, may be open or proprietary, of broad or narrow interest. A new processor can use existing CXUs and CX libraries. A new composable extension, CXU, and library can be used by existing CPUs and systems. Many CXUs may implement a given composable extension, and many libraries may use a composable extension.

Such open composition requires routine, robust integration of separately authored, separately versioned elements into stable systems that *just work* so that if the various hardware and software elements correctly work separately, they correctly work together, and so that if a composed system works correctly today, it continues to work, even as extensions and implementations evolve across years and decades.

Composition also requires an unlimited number of independently developed composable extensions to coexist within a fixed ABI and ISA. This is achieved with *composable extension multiplexing*, described below.

# 1.2. Examples

Alice develops a multicore RISC-V-based FPGA SmartNIC application processor subsystem. The software stack includes processes that already use a cryptography CX library that issues custom instructions, of a cryptography composable extension, that execute on a cryptography composable extension unit.

Profiling reveals a compute bottleneck in file block data compression. Fortunately, the compression library can use a hardware-accelerated compression composable extension, if present in the system. Alice obtains a compression CXU package that implements the extension, adds it to the MPSoC system manifest, configures its parameter settings, then re-composes and rebuilds the FPGA design. The cryptography CXU, compression CXU, CXU interconnect, and CPU cores all use the same CXU Logic Interface, so this incurs no RTL coding. The system CXU map (a new part of the device tree) is updated to map from the compression composable extension ID (CX\_ID) (a 128-bit GUID) to the compression unit CXU\_ID.

The compression library calls the CX Runtime to discover if compression acceleration is available. The runtime consults the CXU map for that CX\_ID, finding the compression CXU\_ID. Next the library uses the CX Runtime to select the compression extension, and its CXU, prior to issuing compression instructions to this CXU. Later the cryptography library uses the same CX Runtime API to discover and select the cryptography extension prior to issuing cryptography instructions to the cryptography CXU.



Figure 1. Bob's system, composed from CPU and CXU packages and composable extension libraries

Later, Bob takes Alice's system design, replaces the CPU cores with different (but also CXU-compatible) cores, and adds an ML inference library. For further acceleration, Bob defines a new binary neural network inference composable extension, IBNN, identified with a new CX\_ID he mints. Bob's new BNN custom instructions reuse the standard custom instruction encodings, which is fine because they're scoped to IBNN. Bob develops bobs\_bnn\_cxu core, and CXU metadata that describes it. He adds that package to the system manifest and rebuilds the system,

updating the CXU map. Bob's system now runs highly accelerated with cryptography, compression, and inference custom function instructions issuing from the various CPU cores and executing in the various CXUs.

Figure 1 illustrates this. A *Composer* tool assembles and configures the reusable, composable CPU and CXU RTL packages into a complete system, per the system manifest, and generates a devicetree (or similar) that determines the system CXU map. Each accelerated library uses the Runtime to select its respective custom iterface, and its CXU, prior to issuing custom function instructions of that extension to that CXU.

# 1.3. Scope: reliable composition via strict isolation

To ensure that composition of composable extensions and their CXUs does not subtly change the behavior of any extension, each must operate in isolation. Therefore, each custom function (CF) instruction is of limited scope: exclusively computing an ALU-like integer function of up to two operands (integer register(s) and/or immediate value), with read/write access to the extension's private state (if any), writing the result to a destination register.

A CF may not access other resources, such as floating-point registers or vector registers, pending definition of suitable custom instruction formats.

A CF may not access *isolation-problematic* shared resources such as memory, CSRs, the program counter, the instruction stream, exceptions, or interrupts, pending a means to ensure correct composition by design. (Except that, as with RISC-V floating point extensions, the default error model accumulates CXU errors in a shared CXU status CSR.)



The isolated state of a composable extension can include private registers and private memories.

### 1.3.1. Stateless and stateful composable extensions

A composable extension may be stateless or stateful. For a stateless extension, each CF is a pure function of its operands, whereas a stateful extension has one or more isolated state contexts, and each CF may access, and as a side effect, update, the hart's *current* state context of the extension (only).

Isolated state means that latency notwithstanding, 1) the behavior of the extension only depends upon the series of CF requests issued on that extension and never upon on any other operation of the system; and 2) besides updating extension state, the CXU status CSR, and a destination register, issuing a CF has no effect upon any other architected state or behavior of the system. Issuing a CF instruction may update the current state context of the composable extension but has no effect upon another state context of that extension, nor that of any other extension.

A CXU implementing a stateful composable extension is typically provisioned with one state context per hart, but other configurations, including one context per request, activity, fiber, task, or thread, or a small pool of shared contexts, or several harts sharing one context, or one singleton context, are also possible. Similarly, each CXU in a system may be configured with a different number of its state contexts.

A serializable stateful composable extension supports extension-agnostic context management.



Although composable extensions never introduce nor use CSRs, the same effect can be obtained via custom functions that read or write facets of the extension state context.

# 1.4. Standard extensions and formats

To facilitate an open ecosystem of composable extensions, CXUs, libraries, and tools, the specification defines

common interop extensions and formats:

- the CXU Logic Interface (CXU-LI),
- the Composable Extension Hardware-Software Interface (CX-ABI), including CXU-extensions to RV-I (-Zicx),
- the Composable Extension Runtime API (CX-RT), and
- build-time CXU Metadata (CXU-MD).



Figure 2. Hardware-software extensions stack. New standard extensions and formats are shaded.

The hardware-software extensions stack (Figure 2) shows how these extensions and formats work together to compose user-defined composable extensions  $CX_0$  and  $CX_1$ , their libraries, and their CXUs into a system.

### 1.4.1. CXU Logic Interface (CXU-LI)

The CXU-LI defines the hardware-to-hardware logic extension between a *CXU requester* (e.g., a CPU) and a *CXU responder* (e.g., a CXU). When a custom function instruction issues, the CPU sends a *CXU request*, providing the request's *CXU identifier* (*CXU\_ID*), the *custom function identifier* (*CF\_ID*), \_ state index (*STATE\_ID*), if any, and request data (operands). The CXU performs the custom function then sends a *CXU response* providing response data and error status.

In a system with multiple CPUs and/or CXUs, switch and adapter CXUs accept and route requests to CXUs and accept and route responses back to CPUs. The CXU-LI supports CPUs and CXUs of various *feature levels* of capability and complexity, including combinational CXUs, fixed-latency CXUs, and variable latency CXUs with flow control.

# 1.4.2. -Zicx: composable extensions extension

The -Zicx "composable extensions" extension, repurposes three custom function instruction formats and adds four CSRs, to provide access-controlled composable extension multiplexing and error signaling. The three instruction formats reuse the <code>custom-0</code>, <code>custom-1</code>, and <code>custom-2</code> formats / major opcodes (Waterman & Asanović, 2019, p. 143) but (via CX multiplexing) compose correctly with any preexisting vendor-defined CPU-specific composable extensions and their custom instructions. The four new CXU CSRs are:

- mcx\_selector: selects the hart's current CXU\_ID and STATE\_ID, for composable extension multiplexing;
- cx\_status: accumulates CXU errors;
- mcx\_table, cx\_index: efficient access control to CXUs and CXU state.



A machine mode  $mcx_table$  CSR is probably insufficient given various M/H/S/U privilege levels. This requires additional design work and additional CSRs.

### 1.4.3. Composable extension multiplexing

Composable extension multiplexing provides an inexhaustible collision-free opcode space for CF instructions for diverse composable extensions without resort to any *central assigned opcodes authority*, and thereby facilitates direct reuse of CX library binaries.

A custom-extension-aware library, prior to issuing a CF instruction, must first CSR-write a *system and hart specific* CX selector value to mcx\_selector, routing subsequently issued CF instructions on this hart to its CXU and to a specific state context. Like the -V vector extension's vsetvl instructions, a CSR-write to mcx\_selector is a prefix that modifies the behavior of CF instructions that follow. With each CF instruction issued, the CPU sends a CXU request to the hart's current CXU and its current state. This request is routed by standard switch and adapter CXUs to the hart's *current* CXU, which performs the custom function using the hart's current state context. Its response is routed back to the CPU which writes the destination register and updates cxu\_status.

The mcx\_selector CX selector value, a tuple (CXU\_ID, STATE\_ID), is system specific because different systems may be configured with different sets of CXUs, with different CXU\_ID mappings, and is hart specific because different harts may use different isolated state contexts. Raw CX selector values are not typically compiled into software binaries.

In a system with multiple CX libraries that invoke CF instructions on different extensions, each library uses the CX Runtime to look up selectors for a CX\_ID and update mcx\_selector, routing CF instructions to its extension's CXU and state context. Over time, across library calls, mcx\_selector is written again and again.



Reuse of custom instruction encodings across extensions will make debugging, esp. disassembly, more challenging.

### 1.4.4. IStateContext and serializable stateful composable extensions

The specification defines a composable extension IStateContext with four standard custom functions for serializable stateful composable extensions:

The CXU status indicates cumulative error flags, clean/dirty, and state context size. The read/write state functions access words of the state context.

These standard custom functions enable an extension-aware CX library to access stateful extension specific error status, and an extension-agnostic runtime or operating system to reset, save, and reload state context(s).

# 1.4.5. CX Application Programming Interface and CX-ABI

The CX-API consists of the *CX Runtime* API, and a calling convention rule. Both are necessary for correct discovery, operation, and composition of CX libraries. As described above (1.4.2) the current mcx\_selector CSR selects the current composable extension/CXU and state context for the hart. However, a CX library should not directly create a CX selector value, nor directly access the CSR. Rather a CX library uses the CX Runtime to look up the CX selector value for its composable extension's CX\_ID and to write it to mcx\_selector, prior to issuing CF instructions. For example, using a C++ *RAII* object cx to represent a (scoped) composable extension selection:

The provisional CX-ABI defines a *callee-save* calling convention for mcx\_selector. For example, consider CX library functions a() and b(), for extensions IA and IB, that issue CF instructions af0, af1, bf0, bf1, in this program:

```
main() { a(); }
a() { use_cx a_cx(CX_ID_IA); af0; b(); @1 af1; }
b() { use_cx b_cx(CX_ID_IB); bf0; bf1; }
```

with execution trace:

```
main() { a() { a_cx(); af0; b() { b_cx(); bf0; bf1; ~b_cx(); } @1 af1; ~a_cx(); }
```

With a callee-save discipline, at point @1, upon return from b(), the current composable extension must be IA again. Thus the b\_cx() constructor saves a() 's mcx\_selector value while overwriting it; later its ~b\_cx() destructor restores it. This *RAII* approach also correctly restores mcx\_selector in the event of an exception handling stack unwind.

# 1.5. System composition

# 1.5.1. Metadata and system manifest

To support automatic composition of CPUs and CXUs into working systems, this specification defines a standard CXU metadata format that details each core's properties, features, and configurable parameters, including CXU-LI feature level, data widths, response latency (or variable), and number of state contexts. Each CPU and CXU package, as well as the system manifest, include a metadata file.

# 1.5.2. Composer

A system composer (human or tool) gathers the system manifest metadata and the metadata of the manifest-specified CPUs and CXUs, then uses (manual or automatic) constraint satisfaction to find feasible, optimal parameter settings across these components. The composer may also configure or generate switch and adapter CXUs to automatically interconnect the CPU and the CXUs.

For example, a system composed from a CPU that supports two or three cycle fixed latency CXUs, a  $CXU_1$  that supports response latency of one or more cycles, a  $CXU_2$  that has a fixed response latency of three cycles, and  $CXU_3$  which is combinational (zero cycles latency), overall has a valid configuration with three cycles of CXU latency, with the CPU coupled to a switch CXU, coupled to  $CXU_1$  and  $CXU_2$  and to a fixed latency adapter CXU, coupled to  $CXU_3$ .

### 1.5.3. Diversity of systems and operating systems

Composable composable extensions and CXUs are designed for use across a broad spectrum of RISC-V systems, from a simple RVI2OU-Zicsr-Zicx microcontroller running bare metal fully trusted firmware, to a multicore RVA2OS Linux profile, running secure multi-programmed, multithreaded user processes running various CX libraries, and with privileged hypervisors and operating systems securely managing access control to CXUs and CXU state.

# 1.6. Versioning

Interoperation specifications live for decades. Meanwhile "the only constant is change". This specification anticipates various axes of versioning.

- Specification versioning. This specification and its requirements will evolve. The extensions and formats it specifies will evolve. This includes the CXU Logic Interface, for example.
- CXU-LI versioning. The CXU hardware-hardware extension spec will evolve, with new signals, behaviors, constraints, metadata.
- Composable extension versioning. Any user-defined composable extension may evolve, changing or adding custom functions, changing behaviors, semantics.
- Component implementation versioning. Without changing the extensions it implements, the implementation of a component such as a CXU, CPU, or a CX library may change for a bug fix, a performance enhancement, or any other reason..

How are these anticipated and addressed?

CXU-LI versioning: A CXU module configuration parameter CXU\_LI\_VERSION indicates to the CXU the version of the CXU-LI signals and semantics in effect.

Versioning of the extension multiplexing mechanism: The mcx\_selector.type field determines the current extension multiplexing type. It provides backwards compatibility with legacy custom instructions (i.e., multiplexing off) and forwards compatibility with future extension multiplexing schemes, anticipating future layouts and interpretations of other selector fields and future means of decoding custom- [0123] instructions into CXU requests.

Composable extension versioning: A composable extension is immutable. To change or add any custom functions or their behaviors, a new composable extension must be minted. (Consider the many AVX vector extensions variants have been introduced over many years.) With Microsoft COM software components, an extension IFoo might evolve to become IFoo2. The original IFoo remains and IFoo clients are unaffected. But every component implements IUnknown::QueryInterface(), to determine if the component implements a given extension. A component might implement both extensions, giving its client a choice.

Similarly a CXU might implement two composable extensions, e.g. IPosit, and IPosit2, an enhanced version of IPosit introduced later. In that case, the CXU will have two CXU IDs, CXU\_CXU\_ID\_MAX=2, one for each extension it implements, each present in the CXU Map, from CX\_ID\_IPosit to the first CXU ID and CX\_ID\_IPosit2 to the second. Thus each CX software library present can access the extension, functions, and behavior it depends upon, even if only one CXU module implements both behaviors.

Note how composable extension multiplexing facilitates extension versioning: a new version of an extension (i.e., a new extension) may be introduced at no cost to any existing or future extension.

Implementation versioning: This does not change the extension to a component (e.g., for a CXU, its CXU-LI and the composable extension it implements). At system composition time it may be necessary to specify implementation version requirements, perhaps in metadata, but this should not be visible to, computed upon, nor depended upon, the HW-HW-SW interfaces.



TODO: Add examples of Alice and Bob's travails with their composed SoC designs, over time.

All version numbering uses semantic versioning semver.org.

# 1.7. Pushing the envelope

The hardware-hardware and hardware-software extensions proposed in this draft specification are a foundational step, necessary but insufficient to fully achieve the modular, automatically interoperable extension ecosystem we envision.

A complete solution probably entails much new work, for example in runtime libraries, language support, tools (binary tools, debuggers, profilers, instrumentation), emulators, resource managers including operating systems and hypervisors, and tests and test infrastructure including formal systems to specify and validate composable extensions and their CXU implementations.

Whether or not the specific abstractions and interoperation extensions proposed herein are adopted, we believe this specification motivates composable extension composition, and illustrates *one approach* for such composition scenarios using RISC-V, in sufficient detail to understand how the moving pieces achieve a workable composition system, and to spotlight some of the issues that arise.

# 1.8. Future directions, TODOs

The present specification focuses on composition at the hardware-software extension, and below. Future work includes:

- Expand the scope of composable extensions to include access to non-integer registers, CSRs, and memory, while
  preserving composition.
- Expand the CXU Logic Interface to support greater computation flexibility and speculative execution.
- Design and implement an automatic system composition tool.

# 1.9. Acknowledgements

Composable Extensions are inspired by the Interface system of the Microsoft Component Object Model (COM), a ubiquitous architecture for robust arms-length composition of independently authored, independently versioned software components, at scale, over decades (Microsoft, 2020).



(End of non-normative Introduction section.)

# 2. Composable extensions: the hardwaresoftware interface

The Composable Extension abstraction bridges software and hardware, enabling diverse software libraries which target the same extension and diverse hardware CXU cores which implement the same extension. Then *composable* extension multiplexing enables composition of systems of separately authored and versioned components.

### 2.1. Definitions

A **custom function (CF)** is a function from two integer operands to an integer result and response status. May be stateless or stateful.

A custom function identifier (CF\_ID) is an integer, in the scope of a composable extension, identifying a custom function. A valid CF\_ID is a value that identifies a CF instruction implemented by a configured extension.

A **stateless custom function** is a CF that is a pure function of its operands (only). Never reads nor writes any other architected state. Given the same operand values, always produces the same result and response status.

A stateful custom function is a CF that is a function of its operands and its composable extension state context (only). May read and write the context but never reads or writes other architected state. Equivalently: a CF that is a function of its operands and of any prior CF invocations upon its composable extension (only).

A composable extension (CX, extension) is a fixed named set of custom functions. May be stateless or stateful. *Fixed:* immutable, i.e., any versioning of the CFs or the behavior of an extension necessarily defines a new extension. *Named:* has a composable extension identifier.

A composable extension identifier (CX\_ID) is a 128-bit globally unique ID (GUID) [see RFC-4122], unique in history, identifying a composable extension.

A stateless composable extension is a fixed named set of set of stateless custom functions.

A **stateful composable extension** is a fixed named set of custom functions, at least one of which is a stateful custom function, plus a composable extension state context.

A composable extension state context (state context, state, context) is an isolated collection of state associated with a stateful composable extension. Isolated: stateful custom functions of the extension may read and write the state context, but no other element or operation of the system may read or write the state context.

IStateContext is a stateful composable extension, identified as CX\_ID\_IStateContext, and with four stateful custom functions: {cf\_read\_status, cf\_write\_status, cf\_read\_state, cf\_write\_state}, providing a standard way to manage a composable extension state context. A serializable composable extension is a stateful composable extension that inherits IStateContext.

A configured composable extension (configured extension) is an extension that is configured (included) within a system and is implemented by a CXU of the system (a configured CXU). Within a system, a configured extension has some configured number of state contexts.

A **configured extension subset** is a configured extension in which one or more custom functions of the extension are not implemented. The CF\_IDs of unimplemented custom functions are invalid.

A composable extension state context identifier (STATE\_ID) is an integer index, in the scope of a configured extension, in the range [0, no. of state contexts-1] identifying one of an extension's contexts in the system. A stateless

extension has zero state contexts and uses STATE\_ID=O whenever a STATE\_ID is required. A **valid STATE\_ID** is a value that identifies a state context of a configured extension.

A custom function instruction (CF instruction) is a RISC-V custom instruction that executes a custom function using a composable extension unit, sourcing the integer operands from the register file and/or from an immediate field of the instruction, writing the integer result to the register file, and updating the CX status CSR with the response status.

A composable extension unit (CXU) is a core that implements one or more composable extensions. A stateful CXU implements at least one stateful composable extension.

A CXU\_ID is an integer, in the scope of a system, that identifies a configured extension implemented by a CXU. When one CXU implements multiple configured extensions, different CXU\_IDs identify the configured extensions. A valid CXU\_ID is a CXU\_ID value that identifies a configured extension.

A composable extension selector (CX selector, selector) is a 32-bit value written to mcx\_selector CSR to select the hart's current extension multiplexing type, (e.g., off, type-1, ...) and specify the hart's current configured extension / CXU and current state context.

A CX selector table is a 4 KB aligned, 4 KB sized table of 1024 CX selectors. When CX access control (§2.7) is supported, each hart has a mcx\_table CSR to address its CX selector table.

A selector index is an integer that identifies an entry in a CX selector table (\$2.7).

# 2.2. New CX control / status registers

A -Zicx compatible CPU shall implement the mcx\_selector and cx\_status CSRs for extension multiplexing and custom function instruction execution.

When CX access control (§2.7) is supported, a -Zicx compatible CPU shall implement the mcx\_table and cx\_index CSRs.

All CXU CSR fields marked *reserved* are WPRI, write preserve, read ignored, and all other fields are WARL, write any/read legal values. (An invalid CXU ID or STATE ID value is still *legal*).

All CXU CSRs are initialized to zero on reset.

### 2.2.1. mcx\_selector CSR 0xBC0: select active CXU and state context

The mcx\_selector CSR implements composable extension multiplexing. It is assigned various CX selectors over time. This enables or disables CX multiplexing and selects the hart's current CXU and state context (within that CXU). It may only be read or written in machine level.



In a privileged architecture system, user level read access to mcx\_selector values could reveal goings-on in other software threads and thus facilitate side channel attacks.



In a privileged architecture with M/S/U levels, for example, what CSRs are required and what access permissions should they have?



Figure 3. mcx\_selector CSR OxBCO (type 0: legacy custom instructions))



Figure 4. mcx\_selector CSR OxBCO (type 1: extension multiplexing)

The mcx\_selector CSR has the following fields:

- . type: extension multiplexing type
  - When type=0, disable composable extension multiplexing. The rest of mcx\_selector is ignored. No CXU is selected. Custom-[0123] instructions execute the CPU's built-in custom instructions.
  - When type=1, enable type 1 composable extension multiplexing. The cxu\_id and state\_id fields select the current CXU and state context. Custom-0/-1/-2 instructions issue CXU requests to the CXU identified by cxu\_id and to the state context identified by state\_id.
  - type values 2-15 are reserved.

.cxu id: select the hart's current CXU

- A valid cxu\_id identifies a configured CXU.
- When enabled, when cxu\_id does not identify a configured CXU, executing a CF instruction causes an invalid CXU\_ID error. The cx\_status.CX error bit is set and the CF instruction's destination register, if any, is zeroed.

. state\_id: select the hart's current CXU's current state context

- A valid state\_id identifies a state context of a CXU.
- When enabled, when <code>cxu\_id</code> is valid, but <code>state\_id</code> does not identify a state context of the current CXU, executing a CF instruction causes an invalid STATE\_ID error. The <code>cx\_status.SI</code> error bit is set and the CF instruction's destination register, if any, is zeroed.

No error occurs when mcx\_selector is CSR-written with an invalid CX selector, i.e., when .cxu\_id or .state\_id are invalid. Rather, subsequently executing a CF instruction may cause a CXU\_ID or STATE\_ID error.



The hardware that detects these two errors lies not in the extensible processor but in the CXU interconnect (bad  $.cxu_id$ ) or in the selected CXU iteself (bad  $.state_id$ ).



The type field provides backwards compatibility with legacy custom extensions, and forwards compatibility with future CX systems. In future a new CX type may be added, with a new layout and interpretation of selector fields and new means of decoding custom instruction fields into CXU requests.

### 2.2.2. cx\_status CSR 0x801: CX status

The cx\_status CSR accumulates CXU error flags. It may be written and read in all privilege levels.

Typical application software will write a CX selector to mcx\_selector, write 0 to cx\_status, execute some CF instructions, and read cx\_status to determine if there were any errors.



Figure 5. cx\_status CSR Ox801

The cx\_status CSR has the following fields:

- . TP: invalid CX type error
  - Set by a CSR-write to mcx\_selector, or by a CF instruction, when mcx\_selector.type is invalid. (For example, when new software writes a new selector type that old hardware does not implement.)
- .CI: invalid CXU ID error
  - Set by a CF instruction when mcx\_selector.cxu\_id is invalid.
- .SI: invalid STATE\_ID error
  - Set by a CF instruction when mcx\_selector.cxu\_id is valid but mcx\_selector.state\_id is invalid.
- . OF: state context is off error
  - Set by a CF instruction when mcx\_selector.cxu\_id and mcx\_selector.state\_id are valid but the selected state context is in the off state.
- .FI: invalid CF\_ID error
  - Set by a CF instruction when mcx\_selector.cxu\_id and mcx\_selector.state\_id are valid but the instruction's CF\_ID is invalid.
- . OP: CXU operation error
  - Set by a CF instruction when mcx\_selector.cxu\_id, mcx\_selector.state\_id, and its CF\_ID are valid but there is an error in the requested operation or its operands, in lieu of custom error state.
- .CU: custom CXU operation error
  - Set by a CF instruction of a stateful extension when mcx\_selector.cxu\_id, mcx\_selector.state\_id, and its CF\_ID are valid but there is an error in the requested operation or its operands, with custom (extension-defined) error state available.



The custom error state of a stateful extension may be obtained using custom functions of the extension. In addition, the custom error state of a serializable extension may also be obtained using IStateContext custom functions cf\_read\_status and/or cf\_read\_state.



Should writing mcx\_selector automatically zero cx\_status? This shortens the code path to use an extension by one instruction but it precludes the use case of clearing errors, issuing a series of custom function instructions across multiple extensions, then checking for errors.

For simplicity we do not adopt this option.



How to best anticipate future changes to <code>cx\_status</code>? One option: fields and behavior determined by hart's current CX type (<code>mcx\_selector.type</code>). This becomes unwieldy when multiplexing between extensions switches different types. One option: add a <code>cx\_status.type</code> field, selecting an interpretation of <code>cx\_status</code> CSR fields. Both options may lead to unnecessarily complicated error handling in software. Best option: only add new fields to it. Here simplest seems best.

### 2.2.3. mcx table CSR 0xBC1: CX selector table base

When CX access control (§2.7) is supported, the MXLEN-bit-wide mcx\_table CSR specifies the base address of the hart's CX selector table. The CSR may be read and written in machine level.



Figure 6. mcx\_table CSR OxBC1 (when MXLEN=32)

CSR-writes to mcx\_table zero the twelve least significant bits of the table address, so a CX selector table address must be 4 KB aligned.

### 2.2.4. cx\_index CSR 0x800: CX selector index

When CX access control (§2.7) is supported, the cx\_index CSR selects an entry from the hart's CX selector table entry to write to the mcx\_selector CSR. The CSR may be read and written in all privilege levels.



Figure 7. cx index CSR 0x800

The 10-bit zero-extended index field specifies which entry in the hart's CX selector table (at the hart's mcx\_table) to use as the hart's current CX selector.

In response to CSR-write of cx\_index, load the 32-bit CX selector at address (mcx\_table + cx\_index.index\*4) and CSR-write the CX selector to mcx\_selector, performing the load and the CSR-write at the next higher privilege level, as if it were a lw instruction (and with a lw instruction's memory ordering rules) (§2.7).

# 2.2.5. Implicit CXU CSR fences

Per hart, there is an implicit fence between any CXU CSR access and any series of custom-0/-1/-2 instructions. All CXU CSR accesses happen before any CF instructions which follow, and all CF instructions happen before any CXU CSR accesses that follow.



For example, after issuing a long latency CF instruction, a CSR read of cx\_status must await the CF instruction's CXU response.

# 2.3. Custom function instruction encodings

When mcx\_selector.en=1, software issues CF instructions to the current state context of the current extension (i.e., of the current configured CXU) using R-type, I-type, and flex-type custom function instruction encodings.

For each instruction encoding, the CF instruction specifies the CF\_ID, and source operand values, which may be

two source registers, or one source register and one immediate value. R-type and I-type instructions always write a destination register whereas flex-type instructions never do so.

### 2.3.1. Custom-0 R-type encoding

Assembly instruction: cx\_reg\_cf\_id,rd,rs1,rs2

An R-type CF instruction issues a CXU request for a zero-extended 10-bit CF\_ID cf\_id with two source register operands identified by rs1 and rs2. The CXU response data is written to destination register rd.



Figure 8. CX R-type instruction encoding

### 2.3.2. Custom-1 I-type encoding

Assembly instruction: cx\_imm cf\_id,rd,rs1,imm

An I-type CF instruction issues a CXU request for a zero-extended 4-bit CF\_ID cf\_id with one source register operand identified by rs1 and a signed-extended 8-bit immediate value imm. The CXU response is written to destination register rd.



Figure 9. CX I-type instruction encoding



This new, irregular immediate field encoding may have a disproportionate impact on area and critical path delay in the decode or execute pipeline stages of a RISC-V processor core.

Seven-eighths of the custom-1 encoding space is reserved for future custom function instruction encodings.



Figure 10. CX reserved I-type instruction encodings

# 2.3.3. Custom-2 flex-type encoding

Assembly instruction: cx\_flex cf\_id,rs1,rs2
Assembly instruction: cx\_flex25 custom

A flex-type CF instruction issues a CXU request for a zero-extended 10-bit CF\_ID cf\_id with two source register operands identified by rs1 and rs2. There is no destination register and CXU response data (but not a possible error status) is discarded. The instruction is executed purely for its effect upon the selected state context of the selected CXU.



*Figure 11. CX flex-type instruction encoding* 

Alternatively, equivalently, the cx\_flex25 form of instruction issues an arbitrary 25-bit custom instruction.



*Figure 12. CX flex-type instruction alternate encoding* 



A flex-type CF instruction may be used with a CXU-L2 request's raw instruction field  $req_{insn}$  (3.4.5) to provide an arbitrary 32-7=25-bit custom request to a CXU. The absence of an (integer) destination register field is a feature that provides added, CPU-uninterpreted, custom instruction bits to a CXU.



One disadvantage of this approach: when the selected CXU routinely discards the R[rs1] or R[rs2] operands, use of the flex-type custom function instruction can create a useless false dependency on the rs1 and rs2 registers, which may uselessly delay issue of the CF instruction in an out-of-order CPU core.

# 2.4. Custom function instruction execution via composable extension multiplexing

Figure 13 illustrates how a custom function instruction and the CXU CSRs implement composable extension / CXU composition via composable extension multiplexing. When the CPU issues a custom function instruction, it produces a CXU request from the fields of the instruction, two source operands from the register file and/or an immediate field of the instruction, and the cxu\_id and state\_id fields of mcx\_selector. The CXU request may include the request ID cookie (defined by the CPU), the CXU\_ID, STATE\_ID, raw instruction, CF\_ID, and operands. The CXU\_ID identifies which CXU must process the request. The CXU includes state context(s) and a datapath. The STATE\_ID selects the state context to use for this request. The CXU checks for errors in CXU\_ID, STATE\_ID, and CF\_ID per 2.2.2, processes the request, possibly updating this state context, and produces a CXU response, which may include the same request ID cookie, a success/error status, and the response data. The CPU commits the custom function instruction by updating cx\_status (when response status is an error condition) and writing the response data to the destination register.



Figure 13. HW-SW interface: flow of information for execution of a custom function instruction

Multiple custom function instructions may be in flight at the same time, particularly in a system with pipelined CPUs or pipelined CXUs. A CPU may send a request ID and later receive the (same) ID back to correlate requests sent and responses received.

Table 1 defines the mapping from HW-SW interface entities, such as the cf\_id, rd, rs1, rs2, imm fields of the custom function instruction and the mcx\_selector and cx\_status CSRs, to the CXU Logic Interface's request and response signals (§3.4).

Table 1. Mapping of HW-SW interface entities to CXU-LI signals

|               | •                                                                          |
|---------------|----------------------------------------------------------------------------|
| CXU-LI signal | ← Source or → Destination                                                  |
| req_id        | ← CPU                                                                      |
| req_cxu       | ←mcx_selector.cxu_id                                                       |
| req_state     | ←mcx_selector.state_id                                                     |
| req_insn      | ← insn                                                                     |
| req_func      | ←insn.cf_id                                                                |
| req_data0     | ← R[insn.rs1]                                                              |
| req_data1     | $\leftarrow R[insn.rs2] \{custom-0/-2\} \text{ or } insn.imm \{custom-1\}$ |
| resp_id       | → CPU                                                                      |
| resp_status   | → cx_status bits                                                           |
| resp_data     | $\rightarrow R[insn.rd] \{custom-0/-1\}$                                   |

# 2.4.1. Precise exceptions

Custom function instruction execution preserves precise exception semantics. If an instruction preceding (in

execution order) a custom function instruction is an exception, the custom function instruction does not execute, and has no effect upon architected state, including the <code>cx\_status</code> CSR, and no effect on the current state context of the composable extension / CXU.

If an instruction following (in execution order) a custom function instruction is an exception, the custom function instruction executes, updating destination register, cx\_status, and current state context, as appropriate.

2

A CPU may speculatively issue a CF instruction to a stateless CXU. Misspeculation recovery entails completing and discarding the CXU response. The CF instruction does not commit and there is no change to architectural state.

2

A CPU may not speculatively issue a CF instruction to a stateful CXU because the instruction may update the current state context and the CXU Logic Interface has no means to cancel a CXU request. In other words, a CF instruction of a stateful CXU, once issued, always commits.

Q

Speculation is more than branch prediction. For example, in a pipelined CPU, instructions that follow a load or store instruction typically issue speculatively until the load or store is determined to not raise an access fault. CF instructions of stateful CXUs must not issue in the wake of an instruction that may yet trap.

Q

When a long latency CF instruction issues and a pipelined CPU continues issuing the following instructions in its wake, and one traps, the CPU nevertheless commits the CF instruction when the CXU eventually sends the response.

How can a CPU core determine dynamically whether a CF instruction, or its composable extension, is stateless?



A software-defined approach could decorate the specification of a custom function to indicate whether it is stateful or stateless, and to encode this as an opcode bit in the  ${\tt custom-0/-1/-2}$  instructions. Then a CPU may safely speculatively issue stateless CF instructions but non-speculatively issue stateful CF instructions.

A hardware-defined approach could add to the request and response streams defined in CXU-LI, a third stream, called the commit stream. This enables a CPU to speculatively issue any CF instruction and issue its CXU request, then later, when speculation is resolved, issue its commit token or cancel token. A stateful CXU, receiving and performing a CXU request, would defer from updating any CXU state until the request's corresponding commit token arrives.

# 2.5. IStateContext: the standard custom functions

The IStateContext composable extension defines four standard custom functions to manage extension state context data. Stateful custom extensions should (albeit not *must*) inherit from this extension, i.e., incorporate these four custom functions. IStateContext provides a standard, uniform way to access the extension's custom error state and enables an extension-agnostic runtime or operating system to reset, save, and reload state contexts.

Table 2. Standard stateful custom functions

| Custom function | CF_ID | Assembly instruction | Encoding              |
|-----------------|-------|----------------------|-----------------------|
| cf_read_status  | 1023  | cx_read_status rd    | cx_reg 1023,rd,x0,x0  |
| cf_write_status | 1022  | cx_write_status rs1  | cx_reg 1022,x0,rs1,x0 |

| Custom function | CF_ID | Assembly instruction   | Encoding               |
|-----------------|-------|------------------------|------------------------|
| cf_read_state   | 1021  | cx_read_state rd,rs1   | cx_reg 1021,rd,rs1,x0  |
| cf_write_state  | 1020  | cx_write_state rs1,rs2 | cx_reg 1020,x0,rs1,rs2 |

CF\_IDs 1008-1023 (0x3F0-0x3FF) are reserved for standard custom functions. It is recommended, not mandatory, that these CF\_IDs not be used for another purpose.

Any CF instruction with CF\_ID=1023 must be side effect free, i.e., never modify any CXU state.

### 2.5.1. Interface state context status word

The cf\_read\_status and cf\_write\_status functions access the selected extension state context's status word.



Figure 14. CXU state context status word

The extension state context status word has the following fields:

### .cs: context status

- The state context has four context status values: { O: off; 1: initial; 2: clean; 3: dirty } which correspond to those of the XS field of the mstatus CSR, per the RISC-V Privileged ISA specification (Waterman et al., 2021, p. 26).
- On system reset, each state context of a serializable stateful extension CXU is in the initial state.
- A write .cs=0 has the side effect of explicitly turning off the *current* state context. In this state, all CF instructions except cf\_write\_status and cf\_read\_status signal CXU\_ERROR\_OFF, until the state context status is set to another state by a subsequent cf\_write\_status.
- A write .cs=1 has the side effect of resetting the entire *current* state context to its initial (power up) state.
- When a CF instruction modifies any aspect of the current state context of a serializable CXU, its state context status automatically changes to dirty.

### .state\_size: state context size

- This WARL field specifies the *current* size (number of XLEN-sized words) of the current state context.
- · Reads return the current size of the current state context.
- The value read need not equal the last value written.
- Writes return the previous size and cs status of the current state context.
- Different CXU implementations of the same composable extension may have different state context sizes.
- Different state contexts of the same CXU may have different state context sizes.
- · At different times, the same state context of the same CXU may have different state context sizes.

### .error: custom error status

• An 8-bit custom error status for the current extension / CXU and its state context.



Define rules for what the extension can or must to with writes to this field. Need a way to zero a custom error. But this is not a free byte of storage per state context. An implementation is permitted to implement this as constant O, for example.

### 2.5.2. cx read status standard custom function instruction

Assembly instruction: cx\_read\_status rd

This instruction retrieves the state status word ( $[extension_state_context_status_word]$ ) of the selected state context of the selected CXU and writes it to the rd destination register.

cx\_read\_status can never modify the selected state context, nor modify the behavior of the extension.

The status word .state\_size field may change as a side effect of executing a stateful CF instruction.

For the CF instruction sequence [cx\_read\_status; cx\_read\_state\*; cx\_read\_status], the first and second cx\_read\_status must return the same .state\_size.

For the CF instruction sequence [cx\_read\_status, any-other-CF-instruction\*, cx\_read\_status], the first and second cx\_read\_status need not return the same .state\_size.



For most stateful CXUs, the size of a state context is fixed. For some stateful CXUs, the size of a state context may depend upon the sequence of CF instructions performed. For example, a stateful vector math CXU may provide CF instructions to allocate per-state context vector storage from a common, private shared pool, and may allow different state contexts to represent different sized vectors.

cx\_read\_status may be used as a *probe* after a mcx\_selector write, to check whether the selector addresses a valid CXU and state context:

```
csrw mcx_selector,x1  ; select some CXU and state context
csrw cx_status,x0  ; clear cx_status
cx_read_status x0  ; probe, discarding state status word
csrr x2,cx_status  ; retrieve cx_status
...  ; cx_status.ci => invalid CXU_ID
...  ; cx_status.si => invalid STATE_ID
```

### 2.5.3. cx\_write\_status standard custom function instruction

Assembly instruction: cx\_write\_status rs1

This instruction writes the value of the rs1 source register to the state status word of the selected state context of the selected CXU, and writes the previous value of the state context status word to the rd destination register.

A write . cs=1 always has the side effect of resetting the selected state context to its initial (power up) state.

For the sequence [cx\_write\_status; \*; cx\_read\_status] the value of .state\_size read need not equal the last value written.

A cx\_write\_status CF instruction never has any effect upon any other state context of the CXU, or of any other CXU.

### 2.5.4. cx\_read\_state standard custom function instruction

Assembly instruction: cx\_read\_state rd,rs1

This instruction reads one (XLEN-bit) word of state, at the index specified by the **rs1** source register, from the selected state context of the selected CXU, and writes it to the **rd** destination register.

### 2.5.5. cx\_write\_state standard custom function instruction

Assembly instruction: cx\_write\_state rs1,rs2

This instruction reads the value of the rs2 source register and writes it to the selected state context of the selected CXU at the index specified by the value of the rs1 source register. It also writes the value of the rs2 source register to the rd destination register. It silently drops attempts to write state at an invalid state index.

# 2.6. Resource management and context switching

A software resource manager (e.g., thread pool, language runtime, language virtual machine, RTOS, operating system, hypervisor) multiplexes software loci of execution (e.g., request, worker, actor, activity, task, fiber, continuation, thread, process), *locus* for short, upon one or more hardware threads (*harts*).

The RISC-V per-hart state includes the program counter and integer register file, and optionally, floating point and vector register files, and various CSRs. Composable extensions extension -Zicx extends per-hart state with the CXU CSRs (§2.2) and the various configured state contexts of the stateful configured composable extensions.

A CXU implementing a stateful composable extension is typically configured with one state context per hart in the entire system, but other configurations, including one context per locus, or a small pool of cooperatively or preemptively managed contexts, or several harts sharing one context, or one singleton context, are possible. Similarly, each CXU in a system may be configured with a different number of its state contexts.

The resource manager maintains the mapping of loci to harts, and the mapping of harts to (per-CXU) state contexts. The resource manager consults a *system CXU map* specifying the mapping CXU\_IDs of the configured extensions of the system, and for each extension/CXU, the no. of state contexts it is configured with. A stateless CXU has zero contexts.

Over time, the resource manager must reset, save, and restore hart state, including its extension state contexts, to initialize a hart or to perform a context switch.

To reset hart state, for each extension state context of the hart, execute

```
li a1,{.error=0,.cs=1/*initialize*/}
lw a0,selectors[i]
csrw mcx_selector,a0
cx_write_status a1
```

This resets that state context to its initial state. It is also necessary to reset cx\_status.

```
csrw cx_status,x0
```

To save hart state, first save cx\_status, then for each extension state context of the hart, execute

```
csrr a0,cx_status
sw a0,saved_cx_status
...
lw a0,selectors[i]
csrw mcx_selector,a0
cx_read_status a0
sw a0,status[i]
```

to obtain .state\_size, the size (in XLEN-bit words) of the serialized state context for the selected state context. Allocate array save [i] [] to store the serialized state context. For each word in .state\_size, execute

```
cx_read_state a0,j
sw/sd a0, save[i][j]
```

(When XLEN=32, use sw; when XLEN=64, use sd.)

To restore hart state, for each extension state context of the hart, first execute

```
lw a0, selectors[i]
csrw mcx_selector, a0
lw a0, status[i]
cx_write_status a0
```

to restore the state context status word. Then for each word in status[i].state\_size, execute

```
lw/ld a0, save[i][j]
cx_write_state j,a0
```

to restore each word of the state context. Finally restore the saved cx\_status.

```
lw a0,saved_cx_status
csrw cx_status,a0
```

When different CXUs implement the same composable extension, they may have different serializations, of different sizes.



Discuss preemption scenario where following context save, later restore, the locus moves to a different STATE\_ID of a CXU. cx\_index may (but should not) change. However, resource manager must change mcx\_selector.



cf\_read\_state and cf\_write\_state are random access. It is possible this induces unnecessary CXU hardware area. Perhaps specify a stream-out/stream-in extension instead.



Discuss impact of mixed sized serialized contexts upon system code and upon CXU design. Can a serialized state context ever be too big to reload?



Is it necessary or helpful for CXU metadata to declare fixed- or variable-sized extension state contexts?

### 2.7. CX access control

Fully trusted software, executing in machine level, has full access to every CXU and every state context. Software may write an arbitrary CX selector value to the mcx\_selector CSR, addressing any CXU and any state context. This is sufficient to implement composable extension multiplexing but does not provide means to protect one hart's CXUs' state from another hart, nor to limit a hart's access to a given CXU.

When a CPU implements user level and machine level privileged architecture, an attempt to CSR-write mcx\_selector from user level generates an illegal instruction exception.

Machine level software may provide to user level software an ECALL function to change mcx\_selector.

Alternatively, the machine level illegal instruction exception handler can determine whether the new CX selector value is valid for the user level code executing on the hart, optionally perform the CSR-write on its behalf, and return from exception.

Whether ECALL or exception handler, a detour into system level is prohibitively slow: reconfiguring composable extension multiplexing should take, at most, a few clock cycles.

The optional CX access control CSRs mcx\_table and cx\_index allow less privileged user code to rapidly multiplex composable extensions, but only among those extensions and state contexts that it is granted access by more privileged system code.

CX access control requires at least user level and machine level privileged architecture, and a memory access control system, i.e., either RISC-V PMP or RISC-V virtual memory access control.

For each hart, the system code provisions a *CX selector table*, 4 KB aligned, comprising 1024 32-bit CX selectors, which is read/write to system code and inaccessible from user code. Initially the table is zero filled, as zero is a valid CX selector ( .en=0 which disables composable extension multiplexing). The system code CSR-writes its address to the hart's mcx\_table CSR. Then in response to a system call requesting access to an extension, and one of its state contexts, system code determines whether the access is granted. If so, it determines the CX selector value for it, allocates an entry for that CX selector value in the CX selector table, and returns the index (the *selector index*) of that entry to user code.



This index is analogous to a Unix file descriptor — an opaque token to a resource granted by system code.

To select this CX/CXU and its state, user code CSR-writes its index to cx\_index. In response, the CPU loads from memory (at more privileged level) the CX selector word at that index in the selector table and CSR-writes it to mcx\_selector — no exception handling detour required.



This mechanism also conceals the specific CXU\_ID and STATE\_ID information from user code, precluding some possible side channel attacks.

# 3. Composable Extension Unit Logic Interface

The CXU-LI defines a set of common hardware logic signaling extensions enabling straightforward, correct composition of CPUs and CXUs. In the CXU-LI, a CPU is a requester and a CXU is a responder. The CPU sends a CXU request and eventually receives a CXU response. For each request there is exactly one response.

### 3.1. Definitions

A CXU request (request) is a group of CXU-LI signals that may include request flow control, REQ\_ID, CXU\_ID, CF\_ID, STATE\_ID, the raw instruction, and integer operands, produced by a CXU requester, conveying request data to a CXU.

A CXU response (response) is a group of CXU-LI signals that may include response flow control, REQ\_ID, response status, and integer result, produced by a CXU, conveying response data to a CXU requester.

A request ID (REQ\_ID) is a tag (a magic cookie) that correlates a CXU request and its corresponding CXU response.

A CXU response status (response status, status) is a CXU-LI success/error code produced by a CXU in response to receiving a CXU request, indicating success or else an error in the request's CXU\_ID, CF\_ID, STATE\_ID, operation, or a composable extension specific error.

A CXU requester (requester) is a core that sends CXU requests to CXU(s) and receives CXU response(s) from CXUs.

A **CPU** is a CXU requester that implements RISC-V RV-I-Zicsr-Zicx instruction set, issues CXU requests upon issuing CF instructions, and writes a destination register and the CXU status CSR in response to CXU responses.

A composable extension unit (CXU, responder) is a core that implements one or more composable extensions. It receives CXU requests and sends CXU responses to CXU requesters. A CXU that also issues CXU requests is an intermediary CXU; otherwise it is a leaf CXU.

A Switch CXU (switch) is an intermediary CXU. For each request received, the switch either sends a response itself (e.g., a CXU\_ERROR\_CXU response) or arbitrates and forwards the request to a subordinate CXU, and later forwards the corresponding response to the original requester.

A CXU feature level adapter (adapter) is an intermediary CXU that receives requests and sends responses at one CXU-LI feature level and adapts them for and forwards them to a subordinate CXU with a lesser feature level.

A **configured system (system)** is a computer system including one or more CPUs and zero or more CXUs that implement a set of configured composable extensions.

# 3.2. Example configured system

Figure 15 illustrates a configured system composed of two CPUs and five CXUs, plus two switches and a level adapter for  $CXU_3$ . Each CPU has two harts. CXUs O-2 are stateful and CXUs 3-4 are stateless. Each stateful CXU has one state context per hart.  $CXU_1$  has an additional state context per hart for isolated stateful requests from  $CXU_2$ .



Figure 15. Configured system composed of two CPUs and five CXUs

In general, a CPU that issues one CXU request per cycle is directly coupled to one CXU, usually a switch CXU. A system of CXUs forms a directed acyclic graph.

### 3.3. CXU-LI feature levels

The CXU-LI is stratified into separate feature levels: -LO: combinational; -L1: fixed latency; -L2: variable latency; and -L3: reordering. Each feature level adds yet more CXU request and response signals, module ports, and behaviors to the feature level below it.

Q

Stratification keeps simple use cases simple and frugal, and makes more complex use cases possible.

### 3.3.1. CXU-LO: combinational CXU

The CXU, which implements a stateless composable extension, computes a combinational function of the CXU request, sending a CXU response after some propagation delay. There is no flow control.

Q

Example: combinational bitmanip unit with a population count custom function.

# 3.3.2. CXU-L1: fixed latency CXU

Each cycle, the CXU computes a function of the CXU request and the specified state context, if any, updating the context, sending a CXU response after a configured fixed non-negative number of clock cycles. With an initiation interval of II=1/cycle, there is no flow control of requests or responses.



Examples: stateless: a pipelined multiplier; stateful: a pipelined multiply-accumulate unit wherein the state is the current total.



Perhaps minimum II should also be configurable, e.g. CXU\_INIT\_INTERVAL=1+.

# 3.3.3. CXU-L2: variable latency CXU

The CXU computes a function of the CXU request and the specified state context, if any, updating the context, sending a CXU response, in order, in a later clock cycle. There is request and response flow control so the CXU can suspend receiving requests and the CPU can suspend receiving responses.



Example: a multiply-divide unit with a variable-latency multi-cycle divide, with early-out.

### 3.3.4. CXU-L3: reordering CXU

The CXU computes a function of the CXU request and the specified state context, if any, updating the context, and sending a CXU response in a later clock cycle. Responses for requests with the same state context are sent in order, otherwise may be sent out of order. There is request and response flow control.

CXU-L3 incorporates a request-response ID for the requester to correlate responses received to requests sent.



Example: a stateless, variable latency posit floating point unit, which, having received a pdiv request then a pmul request, responds out of order, sending the pmul response ahead of the pdiv response.

### 3.3.5. Feature levels summary

In summary, all CXU-LI feature levels have request and response function, data, and status. Level 0 is combinational. Level 1 adds clocking, fixed latency, and state contexts. Level 2 adds variable latency and response flow control and the raw instruction. Level 3 adds reordering. (Table 3.)

Table 3. CXU-LI feature levels summary

| Level | CXU type         | Req valid, func, data, resp data, status | Clock, reset, clock enable,<br>state ID, resp valid | Req ready, resp<br>ready, raw insn | Reordering,<br>req ID |
|-------|------------------|------------------------------------------|-----------------------------------------------------|------------------------------------|-----------------------|
| О     | combinational    | Y                                        |                                                     |                                    |                       |
| 1     | fixed latency    | Y                                        | Y                                                   |                                    |                       |
| 2     | variable latency | Y                                        | Y                                                   | Y                                  |                       |
| 3     | reordering       | Y                                        | Y                                                   | Y                                  | Y                     |



Compared to all possible subsets of features, CXU-LI levels are relatively simple and practical. Each level is a superset of lower levels, simplifying composition of dissimilar CXUs using common CXU feature level adapters.

# 3.4. CXU-LI signaling

CXU cores of a particular feature level implement a common set of request and response signals. Table 4 lists all CXU-LI signals of all feature levels in a canonical order: transaction signals (request/response valid, ready, REQ\_ID), context (CXU\_ID, STATE\_ID), function (raw instruction, CF\_ID), and data. The Level column indicates which levels introduce which signals. The Dir column indicates the signal direction from the perspective of a responder. The bit width of each bit vector is determined by a width parameter, configurable per CXU (§3.4.1).

Table 4. All CXU-LI signals, by feature level

| Level | Dir | Port      | Width Parameter | Description      |
|-------|-----|-----------|-----------------|------------------|
| 1+    | in  | clk       |                 | clock            |
| 1+    | in  | rst       |                 | reset            |
| 1+    | in  | clk_en    |                 | clock enable     |
|       | in  | req_valid |                 | request valid    |
| 2+    | out | req_ready |                 | request ready    |
| 3     | in  | req_id    | CXU_REQ_ID_W    | request REQ_ID   |
|       | in  | req_cxu   | CXU_CXU_ID_W    | request CXU_ID   |
| 1+    | in  | req_state | CXU_STATE_ID_W  | request STATE_ID |
|       | in  | req_func  | CXU_FUNC_ID_W   | request CF_ID    |

| Level | Dir | Port        | Width Parameter | Description             |
|-------|-----|-------------|-----------------|-------------------------|
| 2+    | in  | req_insn    | CXU_INSN_W      | request raw instruction |
|       | in  | req_data0   | CXU_DATA_W      | request operand data O  |
|       | in  | req_data1   | CXU_DATA_W      | request operand data 1  |
| 1+    | out | resp_valid  |                 | response valid          |
| 2+    | in  | resp_ready  |                 | response ready          |
| 3     | out | resp_id     | CXU_REQ_ID_W    | response ID             |
|       | out | resp_status | CXU_STATUS_W    | response status         |
|       | out | resp_data   | CXU_DATA_W      | response data           |

All signals are positive-true logic.



It is unfortunate the custom function ID is CF\_ID in the HW-SW interface and FUNC\_ID in CXU-LI.

### 3.4.1. CXU-LI configuration parameters

Table 5 presents CXU-LI bit vector width parameters and ranges of possible values.

*Table 5. CXU-LI width configuration parameters* 

| Level | Quantity | Width Parameter | Range  | Default | Description                 |
|-------|----------|-----------------|--------|---------|-----------------------------|
| 3     | REQ_ID   | CXU_REQ_ID_W    | 0-64   | 0       | request/response ID width   |
|       | CXU_ID   | CXU_CXU_ID_W    | 0-16   | 0       | CXU_ID width                |
| 1+    | STATE_ID | CXU_STATE_ID_W  | 0-16   | 0       | STATE_ID width              |
|       | CF_ID    | CXU_FUNC_ID_W   | 0-10   | 10      | CF_ID width                 |
| 2+    | insn     | CXU_INSN_W      | 0, 32  | 0       | raw instruction width       |
|       | data     | CXU_DATA_W      | 32, 64 | 32      | request/response data width |
|       | status   | CXU_STATUS_W    | 3      | 3       | response status width       |



Zero width bit vectors are problematic in some HDLs. Parameter signals declared O-bits wide should nevertheless be declared [0:0], driven 1'bO by sender, and ignored by receiver.



When CXU\_FUNC\_ID\_W<10, how do standard custom functions (CF\_ID in [Ox3F0..Ox3FF]) work?

Table 6 presents other CXU configuration parameters.

Table 6. CXU-LI: other CXU configuration parameters

| Level | Parameter         | Range      | Default    | Description                                                         |
|-------|-------------------|------------|------------|---------------------------------------------------------------------|
|       | CXU_LI_VERSION    | 24'h010000 | 24'h010000 | CXU-LI version; 24'h01_00_00 == 1.00.00                             |
|       | CXU_N_CXUS        | 1+         | 1          | number of CXUs at/below this CXU                                    |
| 1+    | CXU_N_STATES      | 0+         | 0          | number of composable extension state contexts                       |
| 1     | CXU_LATENCY       | 0+         | 1          | latency (clock cycles) from a request to its response               |
| 1     | CXU_RESET_LATENCY | 0+         | 0          | min. latency (clock cycles) from negation of reset to first request |

CXU\_LI\_VERSION indicates the version of the CXU-LI signals and semantics in effect, using semantic versioning semver.org, encoded as 24'hxx\_yy\_00: (major=xx,minor=yy,patch=00). Since CXU\_LI\_VERSION is an extension specification and not an implementation, there is never a patch level. See also §1.6.



CXU\_LI\_VERSION anticipates subsequent evolution of CXU-LI.

CXU\_N\_CXUS is the number of logical CXUs at/below this CXU. For a leaf CXU this may be more than one when the CXU implements multiple composable extensions (including multiple versions of one composable extension).

CXU\_N\_STATES is the number of composable extension state contexts for every stateful extension implemented by this CXU. It must be 0 if every composable extension implemented by the CXU is stateless. It must be 1+ if any composable extension implemented by the CXU is stateful. When a leaf CXU implements multiple stateful composable extensions, i.e. CXU\_N\_CXUS>1, each must be configured with the same number of state contexts.

CXU\_LATENCY and CXU\_RESET\_LATENCY are specific to CXU-L1 fixed latency CXUs. See §3.3.2.

### 3.4.2. Clock, reset, clock enable

CXU-LO is combinational. Other feature levels' signaling is (mostly) synchronous to rising edge (posedge) of clk.

When the reset input signal rst is asserted on posedge clk, it supersedes all other CXU-LI signaling. Any request processing in progress is abandoned, all internal state is reset, and req\_ready and resp\_valid output signals, if present, are negated. A CXU-L1 CXU (which does not have a req\_ready output) must be ready to receive its first request after no more than its configured CXU\_RESET\_LATENCY clock cycles following negation of rst.

A clock enable input signal clk\_en facilitates clock gating of a CXU. When clk\_en is asserted on posedge clk, synchronous elements of the CXU (i.e., memories, registers, flip-flops) may change. When clk\_en is negated on posedge clk, no changes may occur to synchronous elements of the CXU. CXU operation is suspended. Therefore, when negating clk\_en, a CXU requester must disregard all CXU output signals, esp. req\_ready and resp\_valid.



In the twilight of Moore's Law, energy efficiency is a first order design concern, and it is a shame to burn power computing routinely discarded results.



All modern FPGAs enable simple clock gating via free clk\_en inputs on all LUT-cluster D flip-flops.



If a requester never clock gates a CXU with clk\_en, it should assert clk\_en with a constant '1. FPGA and ASIC implementation tools typically optimize away such signals and their D flip-flop clock enables.



Perhaps provide another configuration parameter  $CXU\_USE\_CLK\_EN=0/1$  to configurably-ignore  $clk\_en$ . This could simplify conversion of preexisting RTL function units, sans  $clk\_en$  gating, into new CXUs.

# 3.4.3. Request and response valid-ready flow control

CXU-L2 and -L3 provide CXU request and response channel synchronous valid-ready flow control. For each channel, the sender may assert data and a positive-true data valid signal indicating it is ready to send data. The receiver may assert a positive-true ready signal indicating it is ready to receive data. On posedge clk, if both valid and ready are asserted, data transfers from sender to receiver; otherwise, no transfer occurs during that clock cycle.

Once a sender asserts data and asserts data valid on posedge clk, it must assert the same data and valid on each subsequent posedge clk until the receiver asserts ready and the transfer occurs.

A valid output must not depend (via combinational logic) upon a ready input. However, a ready output may depend upon a valid input.

With request and response flow control, a requester must not indefinitely negate resp\_ready in response to a responder negating req\_ready.



This precludes a potential cyclical wait deadlock in a composed system.

# 3.4.4. Response status / error checking

At any feature level, in response to receiving a CXU request, the CXU error-checks the request data, performs the request, and outputs the first (i.e., lowest numbered) [2:0] resp\_status condition that applies:

Table 7. CXU response status values and conditions

| Name             | Value | Condition                                                                                |  |
|------------------|-------|------------------------------------------------------------------------------------------|--|
| CXU_OK           | О     | no errors occurred processing request                                                    |  |
| CXU_ERROR_CXU    | 1     | req_cxu is not a CXU_ID implemented by CXU                                               |  |
| CXU_ERROR_STATE  | 2     | req_state is not a valid STATE_ID for req_cxu                                            |  |
| CXU_ERROR_OFF    | 3     | req_state is valid but this <i>serializable</i> state context is in the <i>off</i> state |  |
| CXU_ERROR_FUNC   | 4     | req_func is not a valid CF_ID for req_cxu                                                |  |
| CXU_ERROR_OP     | 5     | request operand(s) or state are a domain error for the custom function                   |  |
| CXU_ERROR_CUSTOM | 6     | request causes a custom error (of a serializable composable extension)                   |  |

When parameter CPU\_CXU\_ID\_W=0, req\_cxu is ignored: no CXU\_ERROR\_CXU errors.

When parameter CPU\_STATE\_ID\_W=0, req\_state is ignored: no CXU\_ERROR\_STATE errors.

STATE\_ID=0 is the only valid STATE\_ID for the CXU of a stateless composable extension.

CXU state may change if and only if the response status is one of CXU\_OK, CXU\_ERROR\_OP, or CXU\_ERROR\_CUSTOM.



When a response status is CXU\_ERROR\_CUSTOM, the CXU should update the specified state context's custom error status as a side effect of the request. Otherwise, a CX library may be surprised to observe that the custom error bit cx\_status. CU is set without observing a corresponding error bit upon retrieving (via cx\_read\_status) its state context's error state.

In response to receiving resp\_status of CXU\_ERROR\_CXU, CXU\_ERROR\_STATE, CXU\_ERROR\_OFF, or CXU\_ERROR\_FUNC, a CPU ignores resp\_data and uses zero as the result of the CF instruction.

When a CF instruction writes a destination register, (i.e., custom-0/-1 but not custom-2), the result of the CF instruction is written to the register, irrespective of the CXU response status.



Can certain errors suppress destination register writes? No: data dependent writeback cancelation is irregular and unnecessarily complicates out of order CPUs.



Together these rules ensure { CXU, state, function } ID errors are well behaved at the hardware-software extension. By making the CPU responsible for zeroing such results, each CXU in a system's CXU DAG need not incur redundant logic and delay to respond  $resp_{data}=0$  on these three errors. For synchronously signaled CXU-LI levels, in an FPGA, with reset-able flip-flops, a registered  $resp_{data}$  input may be zeroed for negligible cost.

### 3.4.5. Raw instruction

At CXU-LI feature level 2, or higher, CXU requests may be configured (CXU\_INSN\_W=32) to include the raw instruction word (req\_insn) of the CF instruction issued the CXU request, if the request originates from a CF instruction, or all zeroes otherwise. A CXU may use the raw instruction data to help perform a custom function, or it may ignore the raw instruction entirely.



The raw instruction complements the CF\_ID (req\_func) identifier. CF\_ID is the preferred, future proof way to select a custom function. It is ISA neutral and abstracts the CPU away from CXU, and potentially reduces verification complexity.



However, access to the raw CF instruction word can enable additional use cases. As an example, consider a CXU with a private vector, matrix, or complex number register file. When this CXU receives a CXU request including its raw instruction word, it may opt to ignore either or both of the two integer request operands req\_data0 and req\_data1, and instead partially decode the raw instruction word to recover rs1 and rs2 fields, even rs3 if there are spare CF instruction bits, to determine which of its CXU register file entries to read. Similarly, the CXU can decode the raw instruction word to recover an rd field to determine which CXU-private register file entry to write back and whether to do so.



This feature is best used with the custom-2 flex instruction format which has no rd destination register field, freeing those bits for arbitrary uses.



Does raw instruction access merits security threat modeling? Imagine adversarial CXUs, snoopily watching the dynamic instruction stream go by, even when req\_valid is negated.



Half-baked idea (not recommended): Imagine a dynamic facility by which any arbitrary instruction word, not just  ${\tt custom-0/-1/-2}$  format instructions, may be a CF instruction, issued to a CXU. This might be a table of (mask,pattern) tuples, or a 32-bit  ${\tt mcx\_opcodes\_mask}$  CSR bit vector of 5-bit major opcodes, identifying instructions to divert to the current CXU. Or perhaps, in the hardware domain, a CPU might first issue each instruction to the current CXU, and only execute the instruction in the CPU if the CXU delegates it back to the CPU.

# 3.4.6. Request-response ID

CXU-LI feature level 3 (reordering CXU) includes a request-response ID REQ\_ID, a REQ\_ID\_W -bit signal used by requesters to correlate responses received with requests sent. With each request, the CXU receives the REQ\_ID as req\_id, and later, with each response, the CXU sends back the same REQ\_ID as resp\_id. For each request/response pair, the CXU must send the requester the identical request-response ID value that the requester previously sent to the CXU.

Operation and behavior of a CXU must not depend in any way upon any req\_id value received, except to receive it and later to return it to the requester.



An out-of-order completion CPU may send a REQ\_ID indicating the destination register of the request, and rely upon it when the response eventually returns.

# 3.5. CXU-L0 combinational CXU signaling

A combinational CXU, which implements a stateless composable extension, computes a combinational function of the CXU request, sending a CXU response after some propagation delay. There is no flow control.

### 3.5.1. CXU-LO configuration parameters

Table 8. CXU-LO configuration parameters

| Parameter      | Description                      |  |
|----------------|----------------------------------|--|
| CXU_LI_VERSION | CXU-LI version number            |  |
| CXU_N_CXUS     | number of CXUs at/below this CXU |  |

For CXU\_LI\_VERSION and CXU\_N\_CXUS, see §3.4.1.

### 3.5.2. CXU-LO signals

Table 9. CXU-LO signals

| Dir | Port        | Width Parameter | Description                               |
|-----|-------------|-----------------|-------------------------------------------|
| in  | req_valid   |                 | request valid                             |
| in  | req_cxu     | CXU_CXU_ID_W    | request CXU_ID: selects the requested CXU |
| in  | req_func    | CXU_FUNC_ID_W   | request CF_ID                             |
| in  | req_data0   | CXU_DATA_W      | request operand data O                    |
| in  | req_data1   | CXU_DATA_W      | request operand data 1                    |
| out | resp_status | CXU_STATUS_W    | response status                           |
| out | resp_data   | CXU_DATA_W      | response data                             |

CXU-LO signaling is asynchronous. CXU outputs are pure combinational functions of CXU inputs.



CXU-LO has no resp\_valid signal because it would just reflect req\_valid.

# 3.5.3. CXU-L0 signaling protocol

### Protocol:

- 1. Request transfer
  - a. Requester asserts CXU request signals req\_\* and asserts req\_valid.
  - b. CXU asynchronously receives CXU request.
- 2. Response transfer
  - a. CXU performs steps 1, 2, 4, and 6 of response status / error checking per §3.4.4, and asserts resp\_status.
  - b. CXU asserts resp\_data, a combinational custom function of the operands.
  - c. Requester asynchronously receives CXU response.

As a CXU-LO CXU is combinational, its delay folds into to the path timing analysis of its requester.

### 3.5.4. CXU-LO example



Figure 16. Example CXU-LO signaling protocol waveform

Figure 16 is an example waveform for three CXU-LO requests and responses, arising from executing CF instructions f0(a0,b0), f1(a1,b1), and f2(a2,b2). All three instructions issue to the same CXU u0. Function f1 incurs an error.

# 3.6. CXU-L1 fixed latency CXU signaling

Each cycle, a fixed latency CXU computes a function of the CXU request and the specified state context, if any, updating the context, sending a CXU response after a configured fixed non-negative number of clock cycles. With an initiation interval of II=1/cycle, there is no flow control of requests or responses.

Lacking request flow control, if a CXU-L1 CXU is configured with multiple requesters, requesters must not send multiple simultaneous requests.

# 3.6.1. CXU-L1 configuration parameters

Table 10. CXU-L1 configuration parameters

| Parameter         | Description                                                            |  |
|-------------------|------------------------------------------------------------------------|--|
| CXU_LI_VERSION    | CXU-LI version number                                                  |  |
| CXU_N_CXUS        | number of CXUs at/below this CXU                                       |  |
| CXU_N_STATES      | number of composable extension state contexts                          |  |
| CXU_LATENCY       | latency (clock cycles) from a request to its response                  |  |
| CXU_RESET_LATENCY | minimum latency (clock cycles) from negation of reset to first request |  |

For CXU\_LI\_VERSION, CXU\_N\_CXUS, and CXU\_N\_STATES, see §3.4.1.

CXU\_LATENCY, specific to CXU-L1, configures the CXU latency, which is the number of clock cycles from receiving a request to sending a response, of every custom function implemented by the CXU. CXU\_LATENCY=0 configures the CXU to respond to the request in the same clock cycle.

A CFI-L1 CXU with CXU\_LATENCY=0 resembles a CXU-LO combinational CXU, except it may implement a stateful composable extension.



Example: an extended precision arithmetic CXU which implements add\_save\_carry and add\_with\_carry\_save\_carry CF instructions. Like an ALU, this has zero cycle latency, but supports additional state context(s), each with a carry bit.

CXU\_RESET\_LATENCY, specific to CXU-L1, configures the CXU reset latency, which is the minimum number of clock cycles from negation of rst to first assertion of req\_valid. CXU\_RESET\_LATENCY=0 configures the CXU to be ready for a CXU request in the same cycle that rst is first negated.

#### 3.6.2. CXU-L1 signals

Table 11. CXU-L1 signals

| Dir | Port        | Width Parameter | Description            |
|-----|-------------|-----------------|------------------------|
| in  | clk         |                 | clock                  |
| in  | rst         |                 | reset                  |
| in  | clk_en      |                 | clock enable           |
| in  | req_valid   |                 | request valid          |
| in  | req_cxu     | CXU_CXU_ID_W    | request CXU_ID         |
| in  | req_state   | CXU_STATE_ID_W  | request STATE_ID       |
| in  | req_func    | CXU_FUNC_ID_W   | request CF_ID          |
| in  | req_data0   | CXU_DATA_W      | request operand data O |
| in  | req_data1   | CXU_DATA_W      | request operand data 1 |
| out | resp_valid  |                 | response valid         |
| out | resp_status | CXU_STATUS_W    | response status        |
| out | resp_data   | CXU_DATA_W      | response data          |

#### 3.6.3. CXU-L1 signaling protocol

CXU-L1 is (mostly) synchronous to posedge clk when CXU\_LATENCY>0. See §3.4.2.

#### Protocol:

- 1. Request transfer.
  - a. Requester asserts CXU request signals req\_\* and asserts req\_valid.
  - b. CXU\_LATENCY=0: CXU receives CXU request asynchronously. CXU\_LATENCY>0: CXU receives CXU request on posedge clk.
- 2. Custom function execution.
  - a. CXU performs response status / error checking per §3.4.4.
  - b. CXU performs a custom function of the operands and the selected state context.
  - c. CXU may update the selected state context, logically prior to any updates from subsequent requests.
- 3. Response transfer.
  - a. CXU\_LATENCY=0:
    - i. CXU asserts CXU response signals resp\_valid, resp\_status, and resp\_data asynchronously.
    - ii. Requester receives CXU response asynchronously.
  - b. CXU\_LATENCY>0:
    - i. After (CXU\_LATENCY-1) cycles, CXU asserts resp\_valid, resp\_status, and resp\_data.
    - ii. Requester receives CXU response on posedge clk.

#### 3.6.4. CXU-L1 example



Figure 17. Example CXU-L1 signaling protocol waveform (CXU\_LATENCY=2, CXU\_RESET\_LATENCY=0)

Figure 17 is an example waveform for four CXU-L1 CXU requests and responses, arising from executing four CF instructions f0-f3. Since CXU\_RESET\_LATENCY=0, the CXU is ready for request f0 in cycle 1, the same cycle rst is negated. With CXU\_LATENCY=2, each response occurs 2 (enabled) clock cycles after each request is received. Each instruction issues a CXU request to the same CXU u0. Instructions f0 and f1 use state context s0; f2 and f3 use state context s2. Request f1 results in an error response. With c1k\_en negated in cycles 6-19, the CXU is frozen until cycle 20, when it finally receives the f3 request. The f2 response, otherwise due in cycle 7, is also delayed, until cycle 21.

## 3.7. CXU-L2 variable latency CXU signaling

A variable latency CXU computes a function of a CXU request and the specified state context, if any, updating the context, sending a CXU response, in order, in a later clock cycle. There is **request and response flow control** so the CXU can suspend receiving requests and the requester can suspend receiving responses.



When the requester is a CPU, use of CXU-L2 allows the CPU to delay receipt of a CXU response. This affords the CPU pipeline greater flexibility to dynamically prioritize other units' accesses to register file write port(s). Conversely, CXU-L2 can complicate design of a CXU, which may have to respond to negated resp\_ready by buffering the response in an output FIFO or by applying back pressure through its processing pipeline, or negate req\_ready to delay receipt of new requests.

## 3.7.1. CXU-L2 configuration parameters

Table 12. CXU-L2 configuration parameters

| Parameter      | Description                                   |
|----------------|-----------------------------------------------|
| CXU_LI_VERSION | CXU-LI version number                         |
| CXU_N_CXUS     | number of CXUs at/below this CXU              |
| CXU_N_STATES   | number of composable extension state contexts |

For CXU\_LI\_VERSION, CXU\_N\_CXUS, and CXU\_N\_STATES, see §3.4.1.

#### 3.7.2. CXU-L2 signals

Table 13. CXU-L2 signals

| Dir | Port        | Width Parameter | Description             |
|-----|-------------|-----------------|-------------------------|
| in  | clk         |                 | clock                   |
| in  | rst         |                 | reset                   |
| in  | clk_en      |                 | clock enable            |
| in  | req_valid   |                 | request valid           |
| out | req_ready   |                 | request ready           |
| in  | req_cxu     | CXU_CXU_ID_W    | request CXU_ID          |
| in  | req_state   | CXU_STATE_ID_W  | request STATE_ID        |
| in  | req_func    | CXU_FUNC_ID_W   | request CF_ID           |
| in  | req_insn    | CXU_INSN_W      | request raw instruction |
| in  | req_data0   | CXU_DATA_W      | request operand data O  |
| in  | req_data1   | CXU_DATA_W      | request operand data 1  |
| out | resp_valid  |                 | response valid          |
| in  | resp_ready  |                 | response ready          |
| out | resp_status | CXU_STATUS_W    | response status         |
| out | resp_data   | CXU_DATA_W      | response data           |

## 3.7.3. CXU-L2 signaling protocol

CXU-L2 is synchronous to posedge clk. See §3.4.2. CXU-L2 includes the request's raw instruction. See §3.4.5.

#### Protocol:

- 1. Request transfer.
  - a. Requester asserts CXU request signals req\_\* and asserts req\_valid.
  - b. Responder may assert req\_ready.
  - c. CXU receives CXU request on posedge clk when req\_valid and req\_ready are both asserted, per §3.4.3.
- 2. Custom function execution.
  - a. CXU performs response status / error checking per §3.4.4.
  - b. CXU performs a custom function of the operands and the selected state context.
  - c. CXU may update the selected state context, logically prior to any updates from subsequent requests.
- 3. Response transfer.
  - a. Prior to issuing responses from subsequent requests (i.e., in order of requests) CXU asserts resp\_status and resp\_data and asserts resp\_valid.

- b. Requester may assert resp\_ready.
- c. Requester receives CXU response on posedge clk when resp\_valid and resp\_ready are both asserted, per \$3.4.3.

#### 3.7.4. CXU-L2 example

Figure 18 is an example waveform for four CXU-L2 CXU requests and responses, arising from executing four CF instructions f0-f3. (Assume CXU\_INSN\_W=0, no req\_insn.) Each instruction issues a CXU request to the same CXU u0. Instructions f0 and f1 use state context s0; f2 and f3 use state context s2.



Figure 18. Example CXU-L2 signaling protocol waveform

The CXU receives request f0 in cycle 2 and responds in cycle 3.

Requester asserts request £1 in cycle 3, but it is not received by the CXU until it asserts req\_ready in cycle 4. The CXU sends the £1 response in cycle 6, an error response, a latency of 2 cycles. Requester asserts resp\_ready and receives the response in cycle 7.

Requester asserts request f2 in cycle 6, but it is not received by the CXU until it asserts  $req_ready$  in cycle 7. The CXU responds to f2 in cycle 21, a latency of 14 cycles.

Requester asserts request £3 in cycle 21, and the CXU responds in cycle 22.

## 3.8. CXU-L3 reordering CXU signaling

A reordering CXU computes a function of the CXU request and the specified state context, if any, updating the context, and sending a CXU response in a later clock cycle. Responses for requests with the same context are sent in order, otherwise may be sent out of order. There is request and response flow control.

CXU-L3 incorporates a request-response ID for the requester to correlate responses received to requests sent.

2

This CXU-LI feature level is motivated by past experience building floating point CXUs. Different functions, e.g., comparison, conversion, multiplication, addition, division, and square root, exhibit a wide range of latencies. Some functions, e.g. addition and multiplication, may be pipelined and afford an initiation interval II=1/cycle, while others, e.g. division and square root, may be variable latency and perform one request at a time.

Particularly when a composable extension is stateless and when the requester (e.g., an in-order-issue/out-of-order completion CPU) tolerates out of order responses, response reordering can improve performance and simplify CXU logic by reducing average CXU latency, enabling greater CXU parallelism, and reducing request blocking and response queueing.

2

When a composable extension is stateful, response reordering cannot occur for any sequence of requests with the same state context, to ensure identical response data and program behavior over time and over different CXU implementations of the same composable extension.

#### 3.8.1. CXU-L3 configuration parameters

Table 14. CXU-L3 configuration parameters

| Parameter      | Description                                   |
|----------------|-----------------------------------------------|
| CXU_LI_VERSION | CXU-LI version number                         |
| CXU_N_CXUS     | number of CXUs at/below this CXU              |
| CXU_N_STATES   | number of composable extension state contexts |

For CXU\_LI\_VERSION, CXU\_N\_CXUS, and CXU\_N\_STATES, see §3.4.1.

## 3.8.2. CXU-L3 signals

Table 15. CXU-L3 signals

| Dir | Port      | Width Parameter | Description             |
|-----|-----------|-----------------|-------------------------|
| in  | clk       |                 | clock                   |
| in  | rst       |                 | reset                   |
| in  | clk_en    |                 | clock enable            |
| in  | req_valid |                 | request valid           |
| out | req_ready |                 | request ready           |
| in  | req_id    | CXU_REQ_ID_W    | request REQ_ID          |
| in  | req_cxu   | CXU_CXU_ID_W    | request CXU_ID          |
| in  | req_state | CXU_STATE_ID_W  | request STATE_ID        |
| in  | req_func  | CXU_FUNC_ID_W   | request CF_ID           |
| in  | req_insn  | CXU_INSN_W      | request raw instruction |

| Dir | Port        | Width Parameter | Description            |
|-----|-------------|-----------------|------------------------|
| in  | req_data0   | CXU_DATA_W      | request operand data O |
| in  | req_data1   | CXU_DATA_W      | request operand data 1 |
| out | resp_valid  |                 | response valid         |
| in  | resp_ready  |                 | response ready         |
| out | resp_id     | CXU_REQ_ID_W    | response ID            |
| out | resp_status | CXU_STATUS_W    | response status        |
| out | resp_data   | CXU_DATA_W      | response data          |

#### 3.8.3. CXU-L3 signaling protocol

CXU-L3 is synchronous to posedge clk. See §3.4.2. CXU-L3 includes a request-response ID. See §3.4.6. CXU-L3 includes the request's raw instruction. See §3.4.5.

#### Protocol:

- 1. Request transfer.
  - a. Requester asserts CXU request signals req\_\* (including new CXU-L3 signal req\_id) and asserts req\_valid.
  - b. Responder may assert req\_ready.
  - c. CXU receives CXU request on posedge clk when req\_valid and req\_ready are both asserted, per \$3.4.3
- 2. Custom function execution.
  - a. CXU performs response status / error checking per \$3.4.4.
  - b. CXU performs a custom function of the operands and the selected state context.
  - c. CXU may update the selected state context, logically prior to any updates *to the same state context* from subsequent requests.
- 3. Response transfer.
  - a. Prior to issuing responses from subsequent requests to the same state context (i.e., in order of requests to the same state context) CXU asserts resp\_id, resp\_status, resp\_data and asserts resp\_valid.
  - b. Requester may assert resp\_ready.
  - c. Requester receives CXU response on posedge clk when resp\_valid and resp\_ready are both asserted, per \$3.4.3.

## 3.8.4. CXU-L3 example

Figure 19 is an example waveform for four CXU-L3 CXU requests, illustrating two different valid out-of-order response sequences, arising from executing four CF instructions f0-f3. (Assume CXU\_INSN\_W=0, no req\_insn.)

Each instruction issues a CXU request to the same CXU u0, but with various state contexts s0, s1, s0 (again), and s3. This constrains the CXU to respond to request f0 with state s0, before responding to subsequent request f2 for state s0.

Note that each CXU request is tagged with a req\_id, a value that is returned by the CXU with the corresponding resp\_id, and used by the requester to correlate responses to requests and recover the reordering as necessary.



Figure 19. Example CXU-L3 signaling protocol waveform, with two of the possible response orderings

In the first example response, with signals labeled *Response*, the CXU receives requests (f0, f1, f2, f3) but responds in order (f1, f3, f0, f2). In the second example response, with signals labeled *Another Ordering*, the CXU responds in order (f3, f0, f2, f1). Bother orderings are valid because they preserve the order f0<`f2` caused by these two CXU requests using the same state s0.

## 3.9. CXU feature level adapters

A CXU feature level adapter is an intermediary CXU that receives requests and sends responses at one CXU-LI feature level and adapts them for and forwards them to a subordinate CXU at a lower CXU-LI feature level.

CXU-LI includes a set of configurable adapters to raise any CXU to any higher feature level, easing composition:

- Cvt01: raise LO to L1: add configurable latency pipelining
- Cvt02, Cvt12: raise LO or L1 to L2: add request-response flow control (may suspend requests)



TODO: Describe the L3 adapters, which are just L2 adapters with a request-response ID FIFO.

#### 3.9.1. Cvt01: raise CXU-L0 to CXU-L1

A Cvt01 adapter CXU implements CXU-L1, including its configuration parameters (§3.6.1), adapting L1 requests to and responses from a subordinate combinational LO CXU.

When CXU\_LATENCY=0, the adapter's request/response channels are directly coupled to the subordinate CXU request/response channels. Otherwise, these channels I/Os are registered and pipelined, with a total latency of CXU\_LATENCY cycles.



Automatic pipeline retiming may slice the combinational logic cone into several pipeline stages, achieving higher frequency operation.

#### 3.9.2. Cvt02: raise CXU-L0 to CXU-L2

A Cvt02 adapter CXU implements CXU-L2, including its configuration parameters (§3.7.1), adapting L2 requests to and responses from a subordinate combinational L0 CXU. The adapter has a fixed latency of one cycle — a response is sent one cycle after a request is received.



To avoid arbitrary CXU response queuing, yet keep signaling simple and frugal, the Cvt02 adapter might negate req\_ready on any cycle that it has a valid response waiting (asserting resp\_valid) and the requester negates resp\_ready.

#### 3.9.3. Cvt12: raise CXU-L1 to CXU-L2

A Cvt12 adapter CXU implements CXU-L2, including its configuration parameters (§3.7.1), plus CXU\_LATENCY (§3.6.1), adapting L2 requests to and responses from a subordinate fixed latency L1 CXU.

The CXU\_LATENCY parameter, which specifies the latency of the *subordinate L1 CXU*, typically configures the depth of a response FIFO — an entire response stream must be buffered when the requester, having just issued CXU\_LATENCY of requests to the L1 CXU, negates resp\_ready through as many clock cycles. Eventually, with response transfers paused, the response FIFO fills and the adapter CXU negates req\_ready.

When CXU\_LATENCY=0, the subordinate CXU response must be registered and therefore the adapter's response latency is at least one cycle.

## 3.10. CXU-LI-compliant CPUs

A CXU-LI-compliant CPU implements RISC-V RV-I -Zicsr -Zicx instruction set, sends CXU requests upon issuing CF instructions, and writes a destination register and CXU status CSR in response to CXU responses.

#### 3.10.1. CPUs and CXU-LI feature levels

CPUs, as CXU requesters, use specific CXU-LI feature levels.

An austere single-cycle CPU might use CXU-LO with a combinational CXU (only).

A pipelined in-order CPU might use CXU-L1 with a fixed latency CXU configured for (e.g.) 2 cycles latency. It might also use CXU-L2 with a variable latency CXU, stalling the pipeline during cycles where CF instructions cannot issue because the selected CXU negates req\_ready, and itself negating resp\_ready during write-back cycles when the register file's write port or other necessary resource is unavaiable.

2

An out-of-order completion CPU, i.e. one that may commit low latency instructions before prior high latency instructions, might issue CF instructions to a CXU-L2 variable latency CXU and in some future cycle retire the variable latency CXU response, here again negating resp\_ready when it is unable to accept a response to writeback.

An OoO completion CPU, that handles reordered CXU responses, might use a CXU-L3 reordering CXU.

A CPU may have one or more sets of CXU request and response ports. For each such set, a CPU may send zero or one CXU request per cycle and receive zero or one CXU response per cycle.



Most CPUs send up to one request and receive up to one response. However, a CXU-LI compliant superscalar CPU might send multiple CXU requests and receive multiple CXU responses, to multiple CXUs of the same, or different, CXU-LI feature levels, in parallel, in the same cycle.

## 3.11. Example: CXU signaling in a composed system

Consider Figure 20, a system composed from two single-hart CPUs, two stateful CXUs, and a 2-input, 2-output Switch CXU. Fixed latency CXU<sub>0</sub> implements CXU-L1, configured with CXU\_LATENCY=1. The CPUs, CXU<sub>1</sub>, and Switch22 use/implement CXU-L2. Cvt12, a CXU level converter, up-converts CXU<sub>0</sub> from CXU-L1 to CXU-L2.



Figure 20. CXU-L2 system, with two CPUs, switch CXU, converter CXU, CXU<sub>0</sub> (L1), and CXU<sub>1</sub> (L2)

With one hart per CPU, the composable extensions' CXUs are configured with two state contexts each (<2>).

Both  $CPU_0$  and  $CPU_1$  are configured to issue CF instructions mapping  $CX\_ID_0 \rightarrow CXU\_ID=0 \rightarrow CXU_0$  and  $CX\_ID_1 \rightarrow CXU\_ID=1 \rightarrow CXU_1$ .

The exemplary 2x2 Switch CXU is frugal, if low frequency, while sustaining one cycle initiation interval transfers of requests and responses. It multiplexes downstream request transfers and upstream response transfers. In both directions, the switch consists of input ports (not registered), output port registers, an approximately fair output port arbiter, and a 2x2 channel crossbar. Each cycle, the switch determines which output ports are *available* (i.e., are empty, or will transfer (valid & ready) this cycle) and which valid inputs are *eligible* to transfer, then asserts ready, and transfers, some eligible inputs to available output ports, based upon a rotating priority order.

A request input port is eligible to transfer if it is valid and if the target req\_cxu CXU\_ID is the same as the last request, or if there are no pending responses for this port. This ensures that responses for requests, routed to different CXUs with different latencies, are always returned in order to the requester, as required by CXU-L2.

Downstream request routing is per the request inputs' <a href="req\_cxu">req\_cxu</a> elements: CXU\_ID=0 routes to the first output port and CXU\_ID=1 routes to the second output port. The switch itself responds to requests with invalid CXU\_IDs with a CXU\_ERROR\_CXU response.

For upstream response routing, the Switch incorporates, for each subordinate CXU, a FIFO queue that records the requester port ID that issued each request to that CXU. As each (in order) response from that CXU is received, the requester port ID is dequeued from that FIFO and used to route the response to its corresponding requester.

In this example, assume each CPU decouples issue and commit using a scoreboarded register file enabling arbitrary extension unit latencies. Each CPU runs the same code (Listing 1):

- 1. Write mcx\_selector for CXU\_ID=0 and STATE\_ID=HART\_ID, issue two CF instructions to CXU<sub>0</sub>;
- 2. Write mcx\_selector for CXU\_ID=1 and STATE\_ID=HART\_ID, issue two CF instructions to CXU<sub>1</sub>;
- 3. Write mcx\_selector for CXU\_ID=0 and STATE\_ID=HART\_ID, issue one CF instruction to CXU<sub>0</sub>.

Listing 1. Issue stateful CF instructions f0 and f1 to  $CXU_0$ , f2 and f3 to  $CXU_1$ , and f4 to  $CXU_0$  again.

```
csrw mcx_selector,x20 ; select CXU_ID=0 and STATE_ID=HART_ID
cx_reg 0,x3,x1,x2 ; u0.f0
cx_reg 1,x6,x5,x4 ; u0.f1

csrw mcx_selector,x21 ; select CXU_ID=1 and STATE_ID=HART_ID
cx_reg 2,x9,x7,x8 ; u1.f2
cx_reg 3,x12,x11,x10 ; u1.f3

csrw mcx_selector,x20 ; select CXU_ID=0 and STATE_ID=HART_ID again
cx_reg 4,x15,x13,x14 ; u0.f4
```

Figure 21 is an example waveform executing Listing 1 near-simultaneously on the two CPUs of Figure 20.

```
(1:u2<3>.f4 denotes CXU request #1 with CXU ID=2 STATE ID=3 CF ID=4)
```

In the narrative that follows, that A sends B means A asserts B ahead of next posedge clk, whereas B transfers to C means during this cycle C receives and accepts it. Recall with CXU-L2, request transfers occur when both req\_valid and req\_ready are asserted (§3.4.3), and response transfers occur when resp\_valid and resp\_ready are asserted.



Figure 21. Example 2-input 2-output CXU-L2 Switch CXU signaling protocol waveform

#### Cycle-by-cycle:

- O. Both CPUs CSR-write their hart's mcx\_selector registers, selecting CXU\_ID=0=CXU<sub>0</sub>, and their hart's STATE\_ID.

  Both CPUs issue the first CF instruction (f0).
- O. CPU<sub>0</sub> sends first CXU request (request #0): CXU\_ID=0 STATE\_ID=0 CF\_ID=0, a.k.a. 0:u0<0>.f0. CPU<sub>1</sub> sends first CXU request (request #5): CXU\_ID=0 STATE\_ID=1 CF\_ID=0, a.k.a. 5:u0<1>.f0.
- CPU<sub>0</sub>'s first request, destined for CXU<sub>0</sub>, wins arbitration for Switch output port O. Switch asserts CPU<sub>0</sub>'s req\_ready and negates CPU<sub>1</sub>'s req\_ready.
   CPU<sub>0</sub>'s first request 0:u0<0>.f0 transfers to Switch.
   Switch sends CPU<sub>0</sub>'s first request to Cvt12(CXU<sub>0</sub>)
   CPU<sub>0</sub> sends second CXU request: 1:u0<0>.f1.
- 2. CPU<sub>1</sub>'s first request, destined for CXU<sub>0</sub>, wins arbitration for Switch output port 0. Switch asserts CPU<sub>1</sub>'s req\_ready and negates CPU<sub>0</sub>'s req\_ready. CPU<sub>1</sub>'s first request 5:u0<1>.f0 transfers to Switch. Switch sends CPU<sub>1</sub>'s first request to Cvt12(CXU<sub>0</sub>). CPU<sub>1</sub> sends second CXU request: 6:u0<0>.f1. CPU<sub>0</sub>'s first request 0:u0<0>.f0 transfers to CXU<sub>0</sub>. CXU<sub>0</sub> executes 0:f0, updates state <0>, sends response to Switch.
- 3. CPU<sub>0</sub> sends no CXU request this cycle, due to its second csrw execution cycle.

```
CPU<sub>1</sub>'s first request 5: u0<1>.f0 transfers to CXU<sub>0</sub>, executes, updates <1>, sends response to Switch.
     CXU<sub>0</sub>'s response to CPU<sub>0</sub>'s first request transfers to Switch, is sent to CPU<sub>0</sub>.
 4. CPU<sub>1</sub> sends no CXU request this cycle, due to its second csrw execution cycle.
     CPU<sub>1</sub>'s second request 6:u0<0>.f1, wins arbitration, transfers to Switch, is sent to Cvt12(CXU<sub>0</sub>).
     CPU<sub>0</sub>'s second request 1:u0<1>.f1 transfers to CXU<sub>0</sub>, executes, updates <0>, sends response to Switch.
     CXU<sub>0</sub>'s response to CPU<sub>1</sub>'s first request transfers to Switch, is sent to CPU<sub>1</sub>.
     CXU<sub>0</sub>'s response to CPU<sub>0</sub>'s first request transfers to CPU<sub>0</sub>.
 5. CPU<sub>0</sub> bubble in CXU request issue due to its second csrw execution cycle.
     CPU<sub>1</sub> sends third request 2:u1<1>.f2, with CXU_ID=1, destined for CXU<sub>1</sub>.
     CPU<sub>0</sub>'s third request 2:u1<0>.f2, transfers to Switch, is sent to CXU<sub>1</sub>.
     CPU<sub>0</sub> sends fourth request 3:u1<0>.f3, with CXU ID=1, destined for CXU<sub>1</sub>.
     CPU<sub>1</sub>'s second request 6: u0<1>.f1 transfers to CXU<sub>0</sub>, executes, updates <1>, sends response to Switch.
     CXU<sub>0</sub>'s response to CPU<sub>0</sub>'s second request transfers to Switch, is sent to CPU<sub>0</sub>.
     CXU<sub>0</sub>'s response to CPU<sub>1</sub>'s first request transfers to CPU<sub>1</sub>.
 6. CPU<sub>1</sub>'s third request 7:u1<0>.f2 wins arbitration, transfers to Switch, is sent to CXU<sub>1</sub>.
     CPU<sub>1</sub> sends fourth request 8:u1<0>.f3, with CXU ID=1, destined for CXU<sub>1</sub>.
     CPU<sub>0</sub>'s third request 2:u1<0>.f2 transfers to CXU<sub>1</sub>, executes, updates <0>, sends response to Switch.
     CXU<sub>0</sub>'s response to CPU<sub>1</sub>'s second request transfers to Switch, is sent to CPU<sub>1</sub>.
     CXU<sub>0</sub>'s response to CPU<sub>0</sub>'s second request transfers to CPU<sub>0</sub>.
 7. CPU<sub>0</sub> sends no CXU request this cycle, due to its third csrw execution cycle.
     CPU<sub>0</sub>'s fourth request 3:u1<0>.f3 wins arbitration, transfers to Switch, is sent to CXU<sub>1</sub>.
     CPU<sub>1</sub>'s third request 7:u1<1>.f2 transfers to CXU<sub>1</sub>, begins execution.
     CXU<sub>1</sub>'s response to CPU<sub>0</sub>'s third request transfers to Switch, is sent to CPU<sub>0</sub>.
     CXU<sub>0</sub>'s response to CPU<sub>1</sub>'s second request transfers to CPU<sub>1</sub>.
 8. CPU<sub>1</sub> sends no CXU request this cycle, due to its third csrw execution cycle.
     CPU_0 sends fifth request 4:u0<0>. £4, with CXU_ID=0, destined for CXU_0.
    At CXU<sub>1</sub>, CPU<sub>1</sub>'s third request 7: u1<0>.f2 completes execution, updates <1>, sends response to Switch.
    CXU<sub>1</sub>'s response to CPU<sub>0</sub>'s third request transfers to CPU<sub>0</sub>.
 9. CPU<sub>0</sub>'s fifth CXU request is ineligible to transfer because CPU<sub>0</sub> has pending requests to CXU<sub>1</sub>. It becomes eligible
     at cycle 13.
     CPU<sub>1</sub>'s fourth request 8:u1<0>.f3 transfers to Switch, is sent to CXU<sub>1</sub>.
     CPU_0's fourth request 3:u1<0>.f3 transfers to CXU_1, begins execution.
     CXU<sub>1</sub>'s response to CPU<sub>1</sub>'s third request transfers to Switch, is sent to CPU<sub>1</sub>.
10. CPU<sub>1</sub> sends fifth request 9:u0<1>.f4, with CXU_ID=0, destined for CXU<sub>0</sub>.
     CPU_0's fourth CXU request 3:u1<0>.f3 continues execution.
     CXU<sub>1</sub>'s response to CPU<sub>1</sub>'s third request transfers CPU<sub>1</sub>.
 11. CPU<sub>1</sub>'s fifth CXU request is ineligible to transfer because CPU<sub>1</sub> has pending requests to CXU<sub>1</sub>. It becomes eligible
     at cycle 14.
     CPU_0's fourth CXU request 3:u1<0>.f3 completes execution, updates <0>, sends response to Switch.
12. CPU<sub>1</sub>'s fourth request 8:u1<1>.f3 transfers to CXU<sub>1</sub>, executes, updates <1>, sends response to Switch.
     CXU<sub>1</sub>'s response to CPU<sub>0</sub>'s fourth request transfers to Switch, is sent to CPU<sub>0</sub>.
13. CXU<sub>1</sub>'s response to CPU<sub>0</sub>'s fourth request transfers to CPU<sub>0</sub>.
     CPU<sub>0</sub>'s fifth request 4:u0<0>.f4 becomes eligible, transfers to Switch, is sent to CXU<sub>0</sub>.
14. CXU<sub>1</sub>'s response to CPU<sub>1</sub>'s fourth request transfers to CPU<sub>1</sub>.
     CPU<sub>1</sub>'s fifth request 9: u0<1>.f4 becomes eligible, transfers to Switch, is sent to CXU<sub>1</sub>.
     CPU<sub>0</sub>'s fifth request 4: u0<0>. f4 transfers to CXU<sub>0</sub>, executes, updates <0>, sends response to Switch.
```

CPU<sub>0</sub>'s second request 1:u0<0>.f1, wins arbitration, transfers to Switch, is sent to Cvt12(CXU<sub>0</sub>).

- 15.  $CPU_1$ 's fifth request 9:u0<1>.f4 transfers to  $CXU_0$ , executes, updates <1>, sends response to SW fifth request transfers to SW is sent to  $CPU_0$ .
- 16.  $CXU_0$ 's response to  $CPU_1$ 's fifth request transfers to Switch, is sent to  $CPU_1$ .  $CXU_0$ 's response to  $CPU_0$ 's fifth request transfers to  $CPU_0$ .
- 17. CXU<sub>0</sub>'s response to CPU<sub>1</sub>'s fifth request transfers to CPU<sub>1</sub>.

## 3.12. Composing CXUs with AXI4-Streams

In some configured systems, preexisting infrastructure components that implement AXI4-Stream protocol may be used to help compose CPUs and CXUs. A fully flow controlled CXU-LI -L2 or -L3 transfer may be transported over two AXI4-Stream (AXI-S) streams, one for requests and one for responses.



For example, in a AMD/Xilinx Versal FPGA, a CPU might transfer CXU requests, via CXU-L2-to-AXI-S bridge, AXI-S-to-NOC bridge, Versal NOC, NOC-to-AXI-S bridge, AXI-S-to-CXU-L2 bridge, to a CXU at the far corner of the FPGA fabric, later transferring CXU responses back to the distant CPU by the same means.

Table 16 presents a recommended canonical mapping between CXU-LI signals and the two AXI-S streams.

Table 16. Recommended mapping between CXU-L2/-L3 and request/response AXI4-Streams

| Dir | CXU-LI Port | Width          | AXI-S Port               |
|-----|-------------|----------------|--------------------------|
| in  | clk         |                | aclk                     |
| in  | rst         |                | aresetn (inverted)       |
| in  | clk_en      |                | -                        |
| in  | req_valid   |                | reqs_tvalid              |
| out | req_ready   |                | reqs_tready              |
| in  | req_id      | CXU_REQ_ID_W   | reqs_tid or reqs_tdest   |
| in  | req_cxu     | CXU_CXU_ID_W   | reqs_tuser or reqs_tdest |
| in  | req_state   | CXU_STATE_ID_W | reqs_tuser               |
| in  | req_func    | CXU_FUNC_ID_W  | reqs_tuser               |
| in  | req_insn    | CXU_INSN_W     | reqs_tuser               |
| in  | req_data0   | CXU_DATA_W     | reqs_tdata               |
| in  | req_data1   | CXU_DATA_W     | reqs_tdata               |
| in  | -           |                | reqs_tlast optional      |
| in  | -           | *              | reqs_tstrb optional      |
| in  | -           | *              | reqs_tkeep optional      |
| out | resp_valid  |                | resps_tvalid             |
| in  | resp_ready  |                | resps_tready             |
| out | resp_id     | CXU_REQ_ID_W   | resps_tid or resps_tdest |
| out | resp_status | CXU_STATUS_W   | resps_tuser              |
| out | resp_data   | CXU_DATA_W     | resps_tdata              |
| out | -           |                | resps_tlast optional     |
| out | -           | *              | resps_tstrb optional     |
| out | -           | *              | resps_tkeep optional     |

When several CXU-LI signals map to a single AXI-S port, the signals are to be concatenated in order, each signal

assigned successively more significant bits. For example, using Verilog concatenation:

```
reqs_tuser = { req_insn,req_func,req_state,req_cxu };
reqs_tdata = { req_data1,req_data0 };
```

Use reqs\_tdest when req\_id and/or req\_cxu indicate/encode a specific AXI-S destination (of a bridge to a CXU). Use resps\_tdest when of resp\_id indicates a specific AXI-S destination (of a bridge to a requester, e.g., CPU).

# 4. CXU Metadata (CXU-MD)

To help automate system composition, each composable hardware core (each CPU and CXU) shall include a metadata file which defines the properties, features, and supported values of its configuration parameters.

For each core, for each configuration parameter, metadata may specify a subset of the set of legal configuration parameter values defined in §3.4.1.

Metadata configuration parameter values are encoded as either a single value, a list of values, or a range of values. For a continuous range of integer values, the parameter value is range, and the inclusive range of values is found in a corresponding parameter whose name ends in \_range. For example,

## 4.1. CXU Metadata

Listing 2 specifies the CXU metadata format, in YAML. Each legal configuration parameter range of §3.4.1 CXU\_PARAM may be overridden (subsetted) through a YAML parameter line param:

The CXU metadata may also be used to specify other custom (non-standard / CXU specific) configuration parameter settings.

Listing 2. CXU metadata format

```
cxu_name: string
cxu_li:
   feature_level: scalar
                                           # required. allowed: 0-3
   state_id_max: scalar | list | 'range'
                                           # level:any. default: any. 0 => stateless
                                           # level:2+. default: 0
   req_id_w: scalar | list | 'range'
                                           # level:any. default: 0
   cxu_id_w: scalar | list | 'range'
   state_id_w: scalar | list | 'range'
                                           # level:1+. default: 0
   insn_w: scalar | list | 'range'
                                           # level:1+. default: 0
   func_id_w: scalar | list | 'range'
                                           # level:any. default: 10
                                           # level:any. default: 32
   data w: scalar | list
   latency: scalar | list | 'range'
                                           # level:1. default: 1
   reset_latency: scalar | list | 'range'
                                          # level:1. default: 0
   xyz_range: [min,max]
                                           # when parameter xyz is range
```



Need some stronger naming of CXUs and CPUs here. Perhaps a GUID, perhaps a URL.



Do we need to specify here which CX\_IDs the CXU implements?

## 4.2. Example CXU metadata

Listing 3 is example CXU metadata for a CXU-L1 CXU which supports only one state context, requires at least 5-bit CF\_IDs, requires XLEN=32, and supports a response latency of 2-4 cycles.

Listing 3. Example CXU metadata (CXU-L1)

```
cxu_name: bobs_bnn_cxu
cxu li:
   feature_level: 1
   state_id_max: 1
                      # only supports 1 state context
   req_id_w:
                      # any req_id is fine
   cxu_id_w: 0
                      # no req_cxu
   state_id_w: 0
                      # no req_state_id
   insn_w: 0
                      # no req_insn
   func_id_w: range
                      # need >= 5-bit CF_IDs
   func_id_w_range: [5,10] # so [5,6,7,8,9,10] are OK
                      # XLEN=64-bit only
   data_w: 64
   latency: [2,3,4]
                      # configurable w/ 2-4 cycles of latency
   other:
   adder_tree: [0,1] # non-standard config parameter
   element_w: [4,8,16,32] # non-standard config parameter
```

### 4.3. CPU Metadata

As described in §3.10, CPUs, as CXU requesters, use specific CXU-LI feature levels. As with CXUs, CPUs use CXU metadata to override configuration parameter defaults, in this case to define what the CPU requires or accepts of its CXU (which is, generally, the root of the DAG of CXUs).

Listing 4. CPU metadata format

```
cpu_name: string
cxu_li: # see [Listing 1].
```

## 4.4. Example CPU metadata

Listing 5 is example CXU metadata for a CPU that requires and supports only 32-bit combinational CXUs.

#### Listing 5. Example CPU metadata (requires a CXU-LO CXU DAG)

```
cpu_name: carols_simple_scalar_cpu
cxu_li:
    feature_level: 0  # LO combinational CXUs only
    state_id_max:  # LO: n/a
    req_id_w:  # LO: n/a
    cxu_id_w:  # supports arbitrary CXU_IDs
    state_id_w:  # LO: n/a
    insn_w:  # LO: n/a
    func_id_w:  # supports arbitrary CF_IDs
    data_w: 32  # XLEN=32-bit only
```

# 4.5. System manifest



TODO



Consider CX library metadata too. "I may use this subset { CF\_IDs } of the CF\_IDs of extension CX\_ID."

# 5. TODO

#### Todo:

- · Chapter on CX Runtime (runtime) API
- · How CX and CXU versioning works; how CXU-LI versioning works
- · A place for miscellaneous design notes

# 5.1. Open design problems (post 1.0)

- Developing, running accelerated libraries on systems where there is no composable extension / CXU implementation.
- · Developer tooling recommendations for disassembly, debugging, profiling, perf monitoring.

## 5.2. Cost model



Here write up a brief estimate of the FPGA area overhead of various -Zicx and CXU-LI mechanisms and behaviors.

# 6. Specification Change History

## 6.1. Version 0.92.231111, 2023-11-11: Add extension multiplexing type.

Introduce *CX mode*, improving CX forward compatibility. Replace mcx\_selector.en with mcx\_selector.type\_. Add cx\_status.tp error field.

# 6.2. Version 0.91.230803, 2023-08-03: Simplify and improve terminology.

Replace term *Custom Interface (CI)* with *Composable Extension (CX)*. Similarly replace *CFU* with *CXU*. And so forth.

| From                       | То                              |
|----------------------------|---------------------------------|
| Custom Interface (CI)      | Composable Extension (CX)       |
| Custom Function Unit (CFU) | Composable Extension Unit (CXU) |
| -Zicfu                     | -Zicx                           |
| mcfu_* and cfu_* CSRs      | mcx_* and cx_* CSRs             |

6.3. Version 0.90.220327, 2022-03-27: First complete draft.

# References

Microsoft. (2020). *Component Object Model: Interfaces and Interface Implementations*. docs.microsoft.com/en-us/windows/win32/com/interfaces-and-interface-implementations

Waterman, A., & Asanović, K. (2019). RISC-V Instruction Set Manual, Volume I: Unprivileged ISA, v. 20191213. github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf

Waterman, A., Asanović, K., & Hauser, J. (2021). *RISC-V Instruction Set Manual, Volume II: Privileged ISA, v. 20211203.* github.com/riscv/riscv-isa-manual/releases/download/Priv-v1.12/riscv-privileged-20211203.pdf