

### **Definition**

Back end is the processing phase that maps the IR to the target representation that is ready to perform the intended computation.

Extends the meaning, regards the target architectures.

An intermediate representation uses the results of the preceding phases (lexical analysis, syntax analysis, semantic analysis) and performs more processing to make it easy generation of targets. While the structures generated by the semantic analysis phase describes what is in the input, the IR generation phase regards the target architecture so that target generation is possible.



## **Architecture**

- Programmers see the computer through the window provided by the designers at the level of its functional architecture.
- · This window is provided by you, the designer. (1)

♦ The term architecture is used here to describe the attributes of a system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flow and controls, the logical design, and the physical implementation. (2)

- 1. From Prof. Dr. Bozşahin's CENG444 course lecture.
- Amdahl, Gene & Blaauw, Gerrit & Brooks, Jr, Frederick. (2000). Architecture of the IBM System/360. IBM Journal of Research and Development. 44. 21-36. 10.1147/rd.82.0087.



In most cases, a compiler's ultimate output is the executable code, which is the finite sequences of the instructions that will be fed to the CPU. Each CPU exposes its programmability through its ISA (Instruction Set Architecture), which defines the instructions by groups (such as those arithmetic and logic, status and flow control, data moving, and so on), addressing modes in conjunction with memory and IO management mechanisms, register file, modes of operation (word length, process space size, privilege levels, and so on), properties critical to concurrency control, and more. The ISA is the most critical, major determinant of the translation to be performed by the compiler.



Instruction Set Architecture can be defined at software level for the purposes of emulation, interpretation, or similar. The target generated for software defined ISA can also be translated further to processor level ISA. The just in time (JIT) compiler that is part of Java execution model is a good example of this two level. According to this model, Java source is translated to bytecodes by the java compiler, then the JIT compiler translates the bytecode to the processor native code to enable execution. You can see the Oak as a historical mark, Dalvik as a virtually modern approach on Android, ART as a target scheme on Android.





# Interoperability over ABI

The Application Binary Interface

- Use of Register File
- Shape of Activation Records
- · Implementation of Specific Calling Conventions
- · Memory Organization (Stack, )
- · Roles specific to the Registers
- · Registers to Scratch
- · Registers to Preserve

An application binary interface is a set of standards / recommendations that governs the data flow between the code units. When respected by the backends the well established ABI enables integration of the code units regardless their source languages. Generally, an Application Binary Interface defines the conventions used in parameter passing, value returning, register utilization. Stack and memory organization may also be addressed to a certain extent. An ABI defines the conventions specific to a certain architecture. Ideally, an ABI responds to every possible calling convention and related formal structures of the activation records.



Even if the set instruction set architectures are the most dominant determinant in generation of the back-ends, the software layers that underpin the execution of the programs must also be considered as part of the architecture which the compiler must be conformant to. The application binary interface (ABI) requirements that define the parameter passing conventions can be different between the operating systems even if they run on the same hardware. The 64 bit versions of Windows and Linux use different ABIs so the code generators must be developed keeping the differences in mind even if they run on the same Intel based PCs for example. On top of these, it is quite possible to develop a code generator that uses totally different, custom ABI to run code in an isolated fashion for some application specific reasons. Integer parameters are passed using 4 register fast call on Windows (RCX, RDX, R8, and R9), 6 register fast call on Linux (RDI, RSI, RDX, RCX, R8, R9). There are more conventional differences that have to be comprehended and applied in machine code synthesis.

For more, see <a href="https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-170">https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-170</a> for windows, <a href="https://www.ired.team/miscellaneous-reversing-forensics/windows-kernel-internals/linux-x64-calling-convention-stack-frame">https://www.ired.team/miscellaneous-reversing-forensics/windows-kernel-internals/linux-x64-calling-convention-stack-frame</a> for Linux.



However, the lower end architecture has more to do with from the perspective of a compiler. There may be cases where the hardware elements from external to the CPU must be considered. As a contemporary hot topic, GPU code generation can be given as an example. NVidia has a custom C++ compiler (CUDA C - nvcc) to generate and run GPU kernels. Industry leader companies are in continuous research and development phases aiming at better GPU code generation.

See, <a href="https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/">https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/</a> to have a rough idea on a customized language processor.

See the presentation from AMD as a very informative resource, "Nicolai Hähnle, Code Generation for AMD GPUs, 2023, AMD". Source: <a href="https://db.in.tum.de/teaching/ws2223/codegen/CodegenForGPUs.pdf?lang=e">https://db.in.tum.de/teaching/ws2223/codegen/CodegenForGPUs.pdf?lang=e</a> <a href="mailto:n.tum.de/teaching/ws2223/codegen/CodegenForGPUs.pdf?lang=e">n.tum.de/teaching/ws2223/codegen/CodegenForGPUs.pdf?lang=e</a> <a href="mailto:n.tum.de/teaching/ws2223/codegen/CodegenForGPUs.pdf?lang=e">n.tum.de/teaching/ws2223/codegen/CodegenForGPUs.pdf?lang=e</a> <a href="mailto:n.tum.de/teaching/ws2223/codegen/CodegenForGPUs.pdf?lang=e">n.tum.de/teaching/ws2223/codegen/CodegenForGPUs.pdf?lang=e</a> <a href="mailto:n.tum.de/teaching/ws2223/codegen/CodegenForGPUs.pdf?lang=e">n.tum.de/teaching/ws2223/codegen/CodegenForGPUs.pdf?lang=e</a> <a href="mailto:n.tum.de/teaching/ws2223/codegen/CodegenForGPUs.pdf">n.tum.de/teaching/ws2223/codegen/CodegenForGPUs.pdf</a>?lang=e</a> <a href="mailto:n.tum.de/teaching/ws2223/codegen/CodegenForGPUs.pdf">n.tum.de/teaching/ws2223/codegen/CodegenForGPUs.pdf</a>?lang=e</a> <a href="mailto:n.tum.de/teaching/ws2223/codegen/CodegenForGPUs.pdf">n.tum.de/teaching/ws2223/codegen/CodegenForGPUs.pdf</a>?lang=e</a> <a href="mailto:n.tum.de/teaching/ws2223/codegen/codegenForGPUs.pdf">n.tum.de/teaching/ws2223/codegen/codegenForGPUs.pdf</a>?lang=e</a> <a href="mailto:n.tum.de/teaching/ws2223/codegen/codegenForGPUs.pdf">n.tum.de/teaching/ws2223/codegen/codegenForGPUs.pdf</a>?lang=e</a> <a href="mailto:n.tum.de/teaching/ws2223/codegen/codegenForGPUs.pdf">n.tum.de/teaching/ws2223/codegen/codegenForGPUs.pdf</a>?lang=e</a> <a href="mailto:n.tum.de/teaching/ws2223/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/codegen/code



As a final remark on the architecture, we must consider the architectural properties that may span multiple layers of a system. Hardware and lower-level software components may become largely variable and application-specific in a way to force the developers to update multiple the layers of their language processors. The impact of alternatives may be so deep that changes in the language definition and the semantics of the whole translation becomes inevitable, most probably in an enriching fashion.



### **Code Generation**

Mapping IC to Machine Code

#### Constraints, conditions, problems

- ISA (Load Store, Register Memory)
- ABI
- Code and Data Organization
  - · Generated code / data
  - RTTI
  - · Object / target code standards
  - · Intrinsic code / objects
  - More ...
- · Instruction Selection
- · Instruction Scheduling
- · Register Allocation

In a way, the target architecture is a generalization that defines the limits of the computations that can be performed. In this sense, if defines the set of the target elements that can be composed up to meet the purpose of the source code.

Instruction Set Architectures (Register Memory, Register Register etc.), compiler runtimes (new operator in C++, software abstractions such as VMs, memory models defining a range of abstractions from allocation strategies to the variables, structures, words, etc.

Instruction selection, scheduling, register allocation. Ideally, the IR (and hence the IC) is isolated from the target architectures. But, it may be more feasible to derive hints for the code generators at the IR processing phase.





**Instruction Selection** 

The complexity of instruction selection derives from the large number of alternative implementations that a typical ISA provides for even simple operations (\*).

6\*i + 5 4\*i + 5

...

mov rax, [rbp + 16] mov rcx, [rbp + 16] imul rax, 6 lea rax, [5 + 4\*rcx]

add rax, 5 ...

...

Techniques may rely on pattern matching on both Graphical and Linear IR.

(\*) Excerpt from "Cooper, K.D., Torczon, L.; Engineering A Compiler, 2nd Edition"

Instruction selection problem arises from the abundance of the alternatives. The code generator may be forced to make a choice between the shorter and the faster. This is highly dependent on the ISA architecture. This is a high complexity job because the execution and the memory costs must be analyzed separately for each possible combination.



# **Code Generation**

#### Instruction Scheduling

Instruction scheduling attempts to reorder the operations in a procedure to improve its running time. In essence, it tries to execute as many operations per cycle as possible. (\*)

Superscalar architectures Instruction level parallelism Forcing architectures (Itanium)

Dependency analysis and out of order execution

Requires detailed analysis of processor specific parallelism

(\*) Excerpt from "Cooper, K.D., Torczon, L.; Engineering A Compiler, 2nd Edition"

Reordering tries to meet optimization targets by moving the code around without affecting the correctness of the computation. Some architectures requires explicit instruction bundles which makes the scheduling problem more visible.



Instruction selection, scheduling, register allocation. Ideally, the IR (and hence the IC) is isolated from the target architectures. But, it may be more feasible to derive hints for the code generators at the IR processing phase.



The method of generating control may be based linear IR or a completely new graph representation may be preferred. In the former case, an edge container refers to the liner IR nodes. The latter method constructs the graph from the linear IR and uses the derived representation at its own right. The CFG is helpful for identifying dead code, moveable code, register allocation, and more.



The edges on the control flow graph is labeled to identify the lifetime of the virtual variables.

The nodes on the graph visited in a loop and following two steps must be performed as long as a label change occurs in the whole graph:

- (1) If a node requires a variable to complete an operation, that variable must appear on all of the incoming edges.
- (2) Each of the variables on the outgoing edges must be copied into the label of the incoming edges except the one that is assigned on that node.

Note that the operations on the nodes of the example graph are compatible with three address notation.



There are different coloring strategies based on heuristics. In 1982, Chaitin proposed a method that views the register allocation problem as a graph coloring problem.



When graph is colored after spilling, the whole liveliness analysis and subsequent steps are repeated. Now, access to f is memory based! Stack is the usual destination for storing spilled variables. These are compiler generated temporaries materialized in memory. Register pressure is a term to address the mismatch between the live variables and the number of the available registers at some execution point.



There are more than one spilling options!



Instruction selection, scheduling, register allocation. Ideally, the IR (and hence the IC) is isolated from the target architectures. But, it may be more feasible to derive hints for the code generators at the IR processing phase.