# About x86-64 by example

*Let the compiler teach you 64-bit x86 assembler, one example at a time.*

## Getting started

Whenever I've helped some get started with assembler, whether in person or via a forum or message board I've always given the same advice regarding [Matt Godbolt's Compiler Explorer](https://godbolt.org): learn from the compiler.

I stand by this advice 100%. However I also suspect that following it is not easy. It can take a lot of work to figure out how to get the compiler to help you! On that basis I decided to throw together an introduction to x86-64 course (x86-64 being something I needed to learn). The course consists almost entirely of assembler examples generated by gcc.

What does that actually look like? Well, the course is full of little boxes full of C code (like the one below). Each one is followed by the resulting assembly output.

In [1]:
%%python -m gcc
int square(int x)
{
    return x * x;
}

square:
	imul	edi, edi
	mov	eax, edi
	ret


Examples like the above are compiled at `-O2` by default, although some include alternative parameters in the first line.

Whilst we might, one day, outgrow the teachings of the compiler, it can give us a **lot** of help along the way. For example, I have a strong interest in machine level debugging, but most of my expertise and experience is on RISC architectures. However I'm confident the compiler can teach me (almost) everything I need to know about assembly level debugging simply because (almost) everything I need to debug will have been run through that same compiler.

The gaps the compiler leaves are usually in the fine details of writing assembler. Good assembly authors know how to use the assembler's macro system to minimize repetition. Sadly it's impossible to learn from the compiler because it never generates macros. So maybe one day you might need to go beyond the scope of the C compiler but, for now, whether you interest is debugging or coding, let's buckle in and let's see just how much assembler the compiler can teach us!

## Running the notebook yourself

For "cosmetic reasons" to code to render assembler from C appears in Appendix A. That means the first attempt to **Run All** will fail because `gcc.py` does not exist until Appendix A has been run. Scroll down to Appendix A and run the cell there. After that simply **Restart and Run All** and it will be solved.

# Introduction

## AT&T versus Intel syntax

Before we start it's probably useful to mention that, unlike almost every other architecture that is remains relevant in the 21st century, x86 supports multiple assembly syntaxes.

Most x86 software that originates from the Unix tradition uses AT&T syntax (which is hardly a suprise since that's where Unix was born). That means that all x86 assembly in the Linux kernel sources uses AT&T syntax and it is the default for GNU tools such as gcc and gdb. AT&T uses:

* `%` as a sigil for register names
* Destination-last operand ordering (e.g. `addl %esi, %edi` â‡’ `edi += esi`)
* Addressing modes demarked with (parentheses)
* Operand widths are explicit (`addl` operates on 32-bit "long" values)

gcc will use AT&T by default (including for inline assembly) and it can be explicitly requested with `-masm=att`:

In [2]:
%%python -m gcc -masm=att
int add_sub(int a, int b, int c, int d)
{
    return a + b + c - d;
}

add_sub:
	addl	%esi, %edi
	leal	(%rdi,%rdx), %eax
	subl	%ecx, %eax
	ret


Most code originating from the DOS/Windows tradition uses Intel syntax, as does the Intel documentation. Intel syntax uses:

* No sigils
* Destination-first operand ordering
* Addressing modes demarked with [square brackets]
* Operand widths are inferred from the register names (`add` operates on 32-bit long value because the opperand is `edi` which is the 32-bit view of the `rdi`/`edi`/`di`/`dl` register)

gcc will use Intel syntax if requested with `-masm=intel`:

In [3]:
%%python -m gcc -masm=intel
int add_sub(int a, int b, int c, int d)
{
    return a + b + c - d;
}

add_sub:
	add	edi, esi
	lea	eax, [rdi+rdx]
	sub	eax, ecx
	ret


Arguably Intel syntax, with it's destination on the left design and the use of [square brackets] for its addressing mode operands, is **much** closer to the idioms used in modern RISC assembly languages (including those both Arm and RISC-V).

In order to help those with a strong RISC background, this document uses, Intel assembly despite the author's strong affiliation to the Unix traditions!

If you prefer to work with AT&T assembly, then go to Appendix A and change `SYNTAX` to `-masm=intel` and regenerate the document (twice).

## x86-64 registers

For now, let's ignore the floating point and vector registers and focus only on the general-purpose integer registers. These are the ones that the base instruction set will focus on. 

```
+------------+------+----+----+    +------------+------+----+----+
|    rax     | eax  | ax | al |    |     r8     |  r8d | r8w| r8b|
|    rbx     | ebx  | bx | bl |    |     r9     |  r9d | r9w| r9b|
|    rcx     | ecx  | cx | cl |    |    r10     | r10d |r10w|r10b|
|    rdx     | edx  | dx | dl |    |    r11     | r11d |r11w|r11b|
|    rsi     | esi  | si |sil |    |    r12     | r12d |r12w|r12b|
|    rdi     | edi  | di |dil |    |    r13     | r13d |r13w|r13b|
|    rbp     | ebp  | bp |bpl |    |    r14     | r14d |r14w|r14b|
|    rsp     | esp  | sp |spl |    |    r15     | r15d |r15w|r15b|
+------------+------+----+----+    +------------+------+----+----+
```
*16 General Purpose registers (including the stack pointer)*

This collection of functions below takes a single argument and returns it as a 64-bit value. This demonstrates moving each of the different views of the same register (`rdi`/`edi`/`di`/`dil`) to `rax`.

In [4]:
%%python -m gcc
#include <stdint.h>
int64_t f_int64(int64_t x) { return x; }
int64_t f_int32(int32_t x) { return x; }
int64_t f_int16(int16_t x) { return x; }
int64_t f_int8(int8_t x) { return x; }

f_int64:
	mov	rax, rdi
	ret
f_int32:
	movsxd	rax, edi
	ret
f_int16:
	movsx	rax, di
	ret
f_int8:
	movsx	rax, dil
	ret


## Procedure call ABI

TODO

# Improbable idioms

At this stage in the course we need to introduce a couple of quirky compiler features.
If this was a "normal" assembler course then this material would appear near the end, or maybe even not appear at all.
However we need to cover this material now because we are planning to let the compiler teach us things.
That means we need to recognize these improbable idioms because the compiler is going to keep showing them to us!



## Generating constant zero

Let's keep things simple and look at the code generated to load the constant zero:

In [5]:
%%python -m gcc
int zero(void) {
    return 0;
}

zero:
	xor	eax, eax
	ret


Weird, it's generating an xor! Most arithmetic operation on x86 take two registers meaning one is both the an operating and the destination. In this case the xor is, effectively, `eax ^= eax`. Regardless of the initial value, eax will be set to zero (and due to zero/sign-extension, rax also becomes zero).

Why?

Take a look a the actual machine code. The xor instruction is just two bytes.

In [6]:
%%python -m gcc objdump
int zero (void) {
    return 0;
}

0000000000000000 <zero>:
   0:	31 c0                	xor    eax,eax
   2:	c3                   	ret


If we compare it to loading the constant one, then we can see a simple `mov` instruction but the instruction sequence if five bytes long. So the compiler prefers `xor` to reduce the code size!

In [7]:
%%python -m gcc objdump
int one (void) {
    return 1;
}

0000000000000000 <one>:
   0:	b8 01 00 00 00       	mov    eax,0x1
   5:	c3                   	ret


## Multiply by constant

In [8]:
%%python -m gcc
long x3(long x) {
    return x * 3;
}

x3:
	lea	rax, [rdi+rdi*2]
	ret


In [9]:
%%python -m gcc
long x17_plus_23(long x) {
    return x * 17 + 23;
}

x17_plus_23:
	mov	rax, rdi
	sal	rax, 4
	lea	rax, [rax+23+rdi]
	ret


We'll talk move about addressing later but for now we'll mention that Intel assembler allows us to ways to express the "plus some constant" addressing mode. gcc typically presents it as `23[rax+rdi]` in the example above. However it can also be expressed entirely within the square brackets: `[rax+rdi+23]`.

TODO: double check this with a build

# Addressing modes

In [10]:
%%python -m gcc
struct s { int a; int b; };
void ss(struct s * ptr)
{
    ptr[5].b = 42;
}

void sn(struct s *ptr, unsigned long offset)
{
    ptr[offset].b = 42;
}

ss:
	mov	DWORD PTR [rdi+44], 42
	ret
sn:
	mov	DWORD PTR [rdi+4+rsi*8], 42
	ret


# Appendix A - gcc.py

In [11]:
%%file gcc.py
import subprocess
import sys

# This is correct for Debian (and works nicely on my Arm laptop).
# Free to replace it with plain old "gcc" if you have an x86-64
# computer!
#CROSS_COMPILE = "x86_64-linux-gnu-"
CROSS_COMPILE = ""
CC = CROSS_COMPILE + "gcc"
OBJDUMP = CROSS_COMPILE + "objdump"
SYNTAX = "-masm=intel"

def gcc(program, parameters=[], no_filter=False):
    def filter(ln):
        ln = ln.strip()
        return (
            # Directives we don't care about
            ln.startswith(".arch") or
            ln.startswith(".file") or
            ln.startswith(".text") or
            ln.startswith(".align") or
            ln.startswith(".p2align") or
            ln.startswith(".global") or
            ln.startswith(".globl") or
            ln.startswith(".type") or
            ln.startswith(".intel_syntax") or
            ln.startswith(".size") or
            ln.startswith(".ident") or
            ln.startswith(".section") or
            # "Wildcard" for .cfi_startproc and .cfi_endproc
            ln.startswith(".cfi_") or
            # Labels that are not used as jump targets
            ln.startswith(".LFB") or
            ln.startswith(".LFE")
        )
        
    try:
        result = subprocess.run(
            [CC, SYNTAX, "-O2", "-xc"] +
                parameters +
                ["-S", "-", "-o", "-"],
            input=program,
            capture_output=True,
            text=True,
            check=True,
        )

        asm = [ln for ln in result.stdout.splitlines()
                        if no_filter or not filter(ln)]
        print("\n".join(asm))

    except subprocess.CalledProcessError as e:
        print(e.stderr)
        # Don't print e.stdout, it contains the partially assembled
        # file rather than any useful error output.

def objdump(program, parameters=[], no_filter=False):
    def filter(ln):
        ln = ln.strip()
        return (
            ln == "" or
            ln.startswith("asm.o") or
            ln.startswith("Disassembly")
        )

    try:
        result = subprocess.run(
            [CC, SYNTAX, "-O2", "-xc"] + parameters + ["-c", "-", "-o", "asm.o"],
            input=program,
            capture_output=True,
            text=True,
            check=True,
        )

        result = subprocess.run(
            [OBJDUMP, "-Mintel", "-d", "asm.o"],
            capture_output=True,
            text=True,
            check=True,
        )

        asm = [ln for ln in result.stdout.splitlines() if no_filter or not filter(ln)]
        print("\n".join(asm))

    except subprocess.CalledProcessError as e:
        print(e.stderr)
        print(e.stdout)

if __name__ == "__main__":
    program = sys.stdin.read()
    if len(sys.argv) > 1 and sys.argv[1] == "objdump":
        objdump(program, sys.argv[2:])
    else:
        gcc(program, sys.argv[1:])

Overwriting gcc.py
