# ELF Exploration

See [ASSEMBLER.md](../docs/ASSEMBLER.md) for more details (including links to more information about ELF files!), but the traditional input for a linker is `.o`/ELF files.

Fully parsing ELF files was out of scope for this project, but this notebook contains some exploration I did into them.

In [34]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [35]:
# We use the pyelftools library to parse ELF files

# Import dependencies
try:
	from elftools.elf.elffile import ELFFile, SymbolTableSection
	from hexdump import hexdump
	# We need to import this directly so that if it's missing we trigger the except handler, because
	# rv32i will throw a different error
	from bitstring import BitArray
	from rv32i import bits_to_line
except ModuleNotFoundError:
	%pip install pyelftools hexdump bitstring # Install if missing

We need a RISC-V ELF file to play with. To get one, run:

```bash
$ ./riscv_gcc_docker.sh -march=rv32i -mabi=ilp32 -c -o elf_example.o ./csrc/elf_example.c
```

To compare this with the equivalent text-based assembly, run:

```bash
$ make asm/compiled/elf_example.s
```

In [36]:
elf = ELFFile.load_from_path('../elf_example.o')

# Some basic sanity checks
assert elf.elfclass == 32, "we only support 32 bit code!"
assert elf.little_endian, "we only support little endian"
assert elf['e_machine'] == 'EM_RISCV', "we only support RISC-V"

## Sections

ELF files are split into sections, each of which has a different name/type/purpose. [This page][sections] has some information about what some of them are. I think some of them are non-standard. [This page][riscv-elf-spec] appears to be the RISC-V ELF spec (or spec modifications), and might be helpful.

As of this writing, here's what was in the ELF file, and what I know about each section. Run the cell below to print the names of all the sections in the current ELF file.

Sections (in order):
- `[NULL]` (0 bytes): Pretty sure this can be ignored
- `.text` (212 bytes): Has executable code in it. I can't figure out what the format of this is.
	- _Very_ weirdly, if you try and disassemble it with a **1 byte** offset (ie. discard the first and last three bytes), it disassembles as mostly-valid (but totally nonsense) assembly.
- `.rela.text` (336 bytes): Also has executable code in it, but maybe relocatable (or relocated?) code?
- `.data` (0 bytes): Read-write non-executable code, contains static or global variables.
- `.bss` (4 bytes): "Read-write section containing uninitialized data", so I think maybe this never has content but might have non-zero size?
- `.sdata` (2 bytes): "This section holds initialized small data that contribute to the program memory image." ([Source][.sdata])
- `.comment` (27 bytes): Pretty sure this is a comment that can be ignored. I've only ever seen one that has information about the GCC version.
- `.Pulp_Chip.Info` (78 bytes): Pulp appears to be a specific type of chip that's safe to ignore? Google has very few results for this section. [Source][pulp]
- `.symtab` (256 bytes): Contains the symbol table (ie. maps functions to ????)
- `.strtab` (106 bytes): Contains the string table (maps string label names to ????)
- `.shstrtab` (81 bytes): This appears to be the section string table (tracks string names of sections in the ELF file)?

The above linked page also mentions:
- `.rodata`: "read-only section containing const variables"

[sections]: https://michaeljclark.github.io/asm.html
[pulp]: https://github.com/chrta/zephyr-sim3u/blob/master/soc/riscv32/openisa_rv32m1/linker.ld
[.sdata]: https://refspecs.linuxfoundation.org/LSB_3.1.1/LSB-Core-PPC64/LSB-Core-PPC64/specialsections.html
[riscv-elf-spec]: https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-elf.adoc#gabi

The following cell will print the contents of each section:

In [41]:
INDENT = '   '
DISASSEMBLER_BYTE_OFFSET = 0
ATTEMPT_TO_DISSASSEMBLE = True

print("Sections in the ELF file (in order):")
for section in elf.iter_sections():
	print(f"- `{'[NULL]' if section.is_null() else section.name}` ({section.data_size} bytes):")

	if section.is_null():
		continue

	data = section.data()


	try:
		# Conceivably should be utf-8, but if it's not valid ascii then it's probably binary
		strrep = data.decode('ascii')
	except UnicodeDecodeError:
		strrep = hexdump(data, result='return')
		# This tries to interpret the data as compiled rv32i assembly and decode it, but that appears
		# not to work. Instead, we just hexdump it
		strrep = hexdump(data, result='return')

		try:
			assert ATTEMPT_TO_DISSASSEMBLE, "don't try if we were told not to"
			assert '.text' == section.name, "only try to decompile code"
			assert section.data_size % 4 == 0, "if it's not a multiple of 32 bits, it's not code"

			print(f"compressed={section.compressed}, data_alignment={section.data_alignment}")
		
			strrep = ""

			data_list = list(data)
			for i in range(DISASSEMBLER_BYTE_OFFSET, len(data_list), 4):
				try:
					cur_bytes = data[i+0:i+4]
					if elf.little_endian:
						cur_bytes = data[::-1]
					# print("BYTES:", cur_bytes)
					bits = BitArray(bytes=cur_bytes, length=32)
					strrep += bits_to_line(bits) + '\n'
				except Exception as err:
					strrep += f"Failed to decode {bits.hex}: {err}" + '\n'
		except:
			strrep = hexdump(data, result='return')

	strrep = f"```\n{strrep}\n```"
	
	print(INDENT + ('\n' + INDENT).join(strrep.split('\n')))

Sections in the ELF file (in order):
- `[NULL]` (0 bytes):
- `.text` (64 bytes):
compressed=0, data_alignment=4
   ```
   jalr zero, ra, 0
   jalr zero, ra, 0
   jalr zero, ra, 0
   jalr zero, ra, 0
   jalr zero, ra, 0
   jalr zero, ra, 0
   jalr zero, ra, 0
   jalr zero, ra, 0
   jalr zero, ra, 0
   jalr zero, ra, 0
   jalr zero, ra, 0
   jalr zero, ra, 0
   jalr zero, ra, 0
   jalr zero, ra, 0
   jalr zero, ra, 0
   jalr zero, ra, 0
   
   ```
- `.rela.text` (24 bytes):
   ```
                 3       
   ```
- `.data` (0 bytes):
   ```
   
   ```
- `.bss` (0 bytes):
   ```
   
   ```
- `.comment` (19 bytes):
   ```
    GCC: (PULP) 9.2.0 
   ```
- `.riscv.attributes` (28 bytes):
   ```
   A   riscv    rv32i2p0 
   ```
- `.symtab` (144 bytes):
   ```
   00000000: 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
   00000010: 01 00 00 00 00 00 00 00  00 00 00 00 04 00 F1 FF  ................
   00000020: 00 00 00 00 00 00 00 00  00 00 00 00 03 00 01 00  .......

# Attempt at Decompiling

In [38]:
symtab: SymbolTableSection  = elf.get_section_by_name('.symtab')

for symbol in sorted(symtab.iter_symbols(), key=lambda sym: sym.entry['st_value']):
	if symbol.entry['st_size'] == 0:
		continue

	print(str(symbol.name).ljust(20), symbol.entry)

main                 Container({'st_name': 24, 'st_value': 0, 'st_size': 64, 'st_info': Container({'bind': 'STB_GLOBAL', 'type': 'STT_FUNC'}), 'st_other': Container({'local': 0, 'visibility': 'STV_DEFAULT'}), 'st_shndx': 1})


In [42]:
section = elf.get_section_by_name('.text')
data = section.data()

hexdump(data, result='return')

assert section.data_size % 4 == 0, "if it's not a multiple of 32 bits, it's not code"
strrep = ""

for i in range(0, len(data_list), 4):
	print(bits_to_line(BitArray(bytes=data[i+0:i+4][::-1], length=32)))

addi sp, sp, -32
sw ra, 28(sp)
sw fp, 24(sp)
addi fp, sp, 32
sw a0, -20(fp)
lw a5, -20(fp)
addi a1, zero, 3
addi a0, a5, 0
auipc ra, -3
jalr ra, ra, 0
addi a5, a0, 0
addi a0, a5, 0
lw ra, 28(sp)
lw fp, 24(sp)
addi sp, sp, 32
jalr zero, ra, 0


In [40]:
text_section = elf.get_section_by_name('.text')
text_data = text_section.data()
# sdata = elf.get_section_by_name('.sdata').data()
sdata = []

for symbol in sorted(symtab.iter_symbols(), key=lambda sym: sym.entry['st_value']):
	if symbol.entry['st_size'] == 0:
		continue

	start_addr = symbol.entry['st_value']
	end_addr = start_addr + symbol.entry['st_size']

	sym_type = symbol.entry['st_info']['type']

	print(str(symbol.name).ljust(20), start_addr, end_addr, sym_type, symbol.entry)

	if sym_type == 'STT_FUNC':
		for i in range(start_addr, end_addr, 4):
			print(bits_to_line(BitArray(bytes=text_data[i+0:i+4][::-1], length=32)))
	elif sym_type == 'STT_OBJECT':
		print(sdata[start_addr:end_addr][::-1], int.from_bytes(sdata[start_addr:end_addr], 'little' if elf.little_endian else 'big'))
	else:
		print(f"Unknown symbol type: {sym_type}")
	print('\n\n')

main                 0 64 STT_FUNC Container({'st_name': 24, 'st_value': 0, 'st_size': 64, 'st_info': Container({'bind': 'STB_GLOBAL', 'type': 'STT_FUNC'}), 'st_other': Container({'local': 0, 'visibility': 'STV_DEFAULT'}), 'st_shndx': 1})
addi sp, sp, -32
sw ra, 28(sp)
sw fp, 24(sp)
addi fp, sp, 32
sw a0, -20(fp)
lw a5, -20(fp)
addi a1, zero, 3
addi a0, a5, 0
auipc ra, -3
jalr ra, ra, 0
addi a5, a0, 0
addi a0, a5, 0
lw ra, 28(sp)
lw fp, 24(sp)
addi sp, sp, 32
jalr zero, ra, 0



