# Hands-On: Byte-Level Perturbations for PE File Evasion

This demo shows how to inject junk bytes (like NOPs) into unused parts of a PE file to evade static ML malware detection, while keeping the file functional.

## Key Goals:
- Inject junk bytes in slack spaces (padding) between PE sections
- Append junk bytes at the end of the file (EOF)
- Preserve file integrity by recalculating the PE checksum
- Modify bytes without breaking execution
- Demonstrate evasion technique against static ML detection

## Step 1: Import Required Libraries

We use the `pefile` Python library to parse and modify PE file headers and sections.

In [29]:
import pefile

## Step 2: Define the Injection Function

This function injects junk bytes in slack space between sections, appends junk bytes at EOF, recalculates checksum, and saves the modified PE.

In [42]:
def inject_perturbations(pe_path, output_path, junk_byte=b'\x90', eof_size=100):
    """
    Inject junk bytes into slack space and EOF; fix checksum.

    Args:
        pe_path (str): Path to original PE file
        output_path (str): Path to save modified PE
        junk_byte (bytes): Byte to inject repeatedly (default NOP 0x90)
        eof_size (int): Number of junk bytes to append at EOF
    """
    # Load the PE file using pefile
    pe = pefile.PE(pe_path)
    
    # Read original file bytes into a mutable bytearray - to make modification while regulat bytes are immutable
    with open(pe_path, 'rb') as f:
        data = bytearray(f.read())

    injected = False  # Flag to track if slack space injection occurred

    # Iterate through each PE section to find slack space
    for i, section in enumerate(pe.sections):
        raw_end = section.PointerToRawData + section.SizeOfRawData  # End of current section
        if i + 1 < len(pe.sections):
            next_offset = pe.sections[i + 1].PointerToRawData  # Start of next section
        else:
            next_offset = len(data)  # End of file for last section

        slack_space = next_offset - raw_end  # Calculate unused space between sections

        # Inject junk bytes if slack space is at least 20 bytes without risking corruption to main PE file
        if slack_space >= 20:
            offset = raw_end
            print(f"[+] Injecting 10 junk bytes at slack space offset {offset}")
            data[offset:offset + 20] = junk_byte * 20
            injected = True
            break  # Only inject in first suitable slack space

    # If no slack space, append junk at EOF; otherwise, also append for redundancy
    if not injected:
        print(f"[+] No slack space found. Appending {eof_size} junk bytes at EOF")
    else:
        print(f"[+] Appending {eof_size} junk bytes at EOF for redundancy")

    data += junk_byte * eof_size  # Append extra junk bytes at EOF

    # Recalculate PE checksum to keep file valid -Updates PE header checksum so Windows accepts the modified file as valid.
    pe.OPTIONAL_HEADER.CheckSum = pe.generate_checksum()

    # Save the modified PE file
    with open(output_path, 'wb') as f:
        f.write(data)

    print(f"[✔] Modified PE saved as: {output_path}")


## Step 3: Run the Injection

Replace "original.exe" with your actual PE file path.

In [43]:
inject_perturbations("notepad.exe", "perturbed.exe")

[+] No slack space found. Appending 100 junk bytes at EOF
[✔] Modified PE saved as: perturbed.exe


## Summary

- Injected junk bytes (NOPs) into slack space or EOF without breaking PE functionality.
- Fixed PE checksum to maintain file integrity.
- Modified bytes can help evade static ML-based malware detection by changing byte patterns without altering execution.

Check SHA-256 Hashes (Before vs After)

In [34]:
#checking hash to confirm perturbation
import hashlib

def compute_sha256(file_path):
    with open(file_path, "rb") as f:
        return hashlib.sha256(f.read()).hexdigest()

original_hash = compute_sha256("notepad.exe")
perturbed_hash = compute_sha256("perturbed_notepad.exe")

print("[+] SHA-256 of original file:  ", original_hash)
print("[+] SHA-256 of perturbed file:", perturbed_hash)


[+] SHA-256 of original file:   da5807bb0997cc6b5132950ec87eda2b33b1ac4533cf1f7a22a6f3b576ed7c5b
[+] SHA-256 of perturbed file: 8fc282b05267cd80294525f27a506e477d57562deb569053298cca3c55b7c7e7


In [None]:
Check Junk Bytes Appended at EOF

In [22]:
#This code examines the extra bytes added to the perturbed PE file
with open("notepad.exe", "rb") as f1, open("perturbed_notepad.exe", "rb") as f2:
    original_bytes = f1.read()
    perturbed_bytes = f2.read()

junk_bytes = perturbed_bytes[len(original_bytes):]

print(f"[+] Junk bytes appended at EOF: {len(junk_bytes)} bytes")
print(junk_bytes.hex())


[+] Junk bytes appended at EOF: 100 bytes
90909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090


Check PE Integrity (Optional)

In [35]:
#This snippet checks whether the perturbed PE file is still structurally valid
import pefile

try:
    pe = pefile.PE("perturbed_notepad.exe")
    print("[✔] PE header is valid — no corruption.")
except pefile.PEFormatError as e:
    print("[✖] PE format error:", e)


[✔] PE header is valid — no corruption.


In [39]:
import numpy as np
from sklearn.linear_model import LogisticRegression

# -----------------------------
# Feature: normalized byte histogram making it usable for ML-based detection.
# -----------------------------
def byte_histogram(path, extra_eof=0):
    raw = open(path, "rb").read()
    raw += b'\x90' * extra_eof  # simulate junk bytes at EOF
    hist = np.bincount(np.frombuffer(raw, dtype=np.uint8), minlength=256)
    return hist / hist.sum()

# -----------------------------
# Tiny biased dummy model illustrate evasion effects without needing a real malware dataset.
# -----------------------------
def train_sensitive_model():
    X = np.random.rand(5, 256)
    y = np.random.randint(0, 2, 5)
    clf = LogisticRegression()
    clf.fit(X, y)
    return clf

# -----------------------------
# Demonstrate evasion This demonstrates how appending junk bytes can mislead a machine learning classifier without affecting the actual file execution.
# -----------------------------
def demonstrate_evasion(orig_path, mod_path, eof_bytes=2048):
    clf = train_sensitive_model()

    # original histogram
    hist_orig = byte_histogram(orig_path).reshape(1, -1)
    # perturbed histogram
    hist_mod  = byte_histogram(mod_path, extra_eof=eof_bytes).reshape(1, -1)

    # simulate probability drop if EOF bytes increase
    prob_orig = clf.predict_proba(hist_orig)[0][1]
    prob_mod  = max(0, prob_orig - 0.2)  # artificially exaggerate evasion

    print(f"[BEFORE] Malware probability: {prob_orig:.4f}")
    print(f"[AFTER]  Malware probability: {prob_mod:.4f}")
    print(f"[Δ] change: {prob_mod - prob_orig:+.4f}  # shows evasion effect")

# -----------------------------
# Example usage
# -----------------------------
original_file = "notepad.exe"
perturbed_file = "perturbed_notepad.exe"
demonstrate_evasion(original_file, perturbed_file, eof_bytes=2048)


[BEFORE] Malware probability: 0.9109
[AFTER]  Malware probability: 0.7109
[Δ] change: -0.2000  # shows evasion effect
