# Week 06 – Binary Analysis & Symbolic Execution  
## Static Malware Analysis with Python (Windows PE Files)

This Jupyter notebook implements the **Week 06 lab** for *Networks and Systems Security*.

The focus is on **safe static analysis** of a benign Windows executable (e.g. `Procmon.exe` from Microsoft Sysinternals) using Python-based tooling.



## Aims of the Seminar

This laboratory exercise introduces the core concepts and practices of **static malware analysis** using Python.

Rather than executing suspicious samples, we focus on safe, static techniques applied to **benign Windows executables** from trusted sources. The workflow mirrors early-stage malware triage:

- Computing cryptographic **hashes** (MD5, SHA1, SHA256)
- Extracting and reviewing **strings**
- Inspecting **PE headers** and imported functions
- Applying simple **YARA rules**
- Combining these steps into a **static triage pipeline**




In [2]:
# Install required libraries (run this once per environment)
%pip install --quiet pefile yara-python


Note: you may need to restart the kernel to use updated packages.


## Configure Your Sample Path

Set the path to your benign Windows executable (e.g. `Procmon.exe`).  
**Edit the string below** to match your real file location.


In [12]:
import os

# TODO: Change this path to the actual location of your Procmon.exe (or other benign PE)
sample = "/Users/dissept/Downloads/ProcessMonitor/Procmon.exe"

if not os.path.isfile(sample):
    print("[!] WARNING: The sample path does not exist. Please update the 'sample' variable above.")
else:
    print("[+] Sample path looks valid:", sample)


[+] Sample path looks valid: /Users/dissept/Downloads/ProcessMonitor/Procmon.exe


## Hash Calculation (IOCs)

In malware analysis, **cryptographic hashes** are one of the most fundamental **Indicators of Compromise (IOCs)**.

They act as unique identifiers for a file, enabling:

- Threat intelligence sharing  
- Duplicate sample detection  
- Quick reputation checks (e.g. VirusTotal)  
- SOC correlation and automated triage

We will compute the following:

- `MD5`  
- `SHA1`  
- `SHA256` (industry standard; better collision resistance)


In [13]:
import hashlib

def compute_hash(path: str, algorithm: str) -> str:
    """Compute the hash of a file using the specified algorithm."""
    h = hashlib.new(algorithm)
    with open(path, "rb") as f:
        for chunk in iter(lambda: f.read(8192), b""):
            h.update(chunk)
    return h.hexdigest()

if os.path.isfile(sample):
    print("Sample:", sample)
    print("MD5:   ", compute_hash(sample, "md5"))
    print("SHA1:  ", compute_hash(sample, "sha1"))
    print("SHA256:", compute_hash(sample, "sha256"))
else:
    print("[!] Please fix the 'sample' path above before running this cell.")


Sample: /Users/dissept/Downloads/ProcessMonitor/Procmon.exe
MD5:    c3e77b6959cc68baee9825c84dc41d9c
SHA1:   bc18a67ad4057dd36f896a4d411b8fc5b06e5b2f
SHA256: 3b7ea4318c3c1508701102cf966f650e04f28d29938f85d74ec0ec2528657b6e


### Avalanche Effect

If you change even **one byte** in the file (e.g. by editing and re-saving it), the hashes will change **completely**.  
This behaviour is known as the **avalanche effect** and is crucial for using hashes as reliable identifiers.


## String Extraction

Binary files, including malware, often contain **human-readable text**.

Analysts review these strings to identify:

- Hardcoded paths  
- Registry keys  
- Network infrastructure (domains, URLs, IPs)  
- Encryption keys or markers  
- Persistence mechanisms  

We will scan the file for printable ASCII sequences of **at least 4 characters**, which usually correspond to readable strings embedded in the executable.


In [14]:
import re

def extract_strings(path: str, min_length: int = 4):
    """Extract printable ASCII strings from a binary file."""
    with open(path, "rb") as f:
        data = f.read()
    pattern = rb"[ -~]{" + str(min_length).encode() + rb",}"
    return re.findall(pattern, data)

if os.path.isfile(sample):
    strings = extract_strings(sample, min_length=4)
    print(f"Extracted {len(strings)} strings. Showing first 20:\n")
    for s in strings[:20]:
        print(s.decode(errors="ignore"))
else:
    print("[!] Please fix the 'sample' path above before running this cell.")


Extracted 26507 strings. Showing first 20:

!This program cannot be run in DOS mode.
V*0T
0RichU
.text
`.rdata
@.data
.rsrc
@.reloc
hpqQ
h`EN
h|nN
h\nN
hlnN
=UUU
h_rM
hDLN
h`GO
hDLN
h|GO
hDLN


For **benign tools** like Procmon, strings will typically include:

- DLL names  
- Menu labels  
- File paths  
- Standard Windows messages  

In actual malware, this step often reveals:

- Command-and-control (C2) domains  
- Suspicious temp file names  
- Embedded scripts or PowerShell commands  
- Obfuscation artefacts


## PE Header Inspection with `pefile`

Most Windows malware is delivered as a **Portable Executable (PE)** file.

By inspecting PE headers, we can learn:

- How the program is structured  
- Which libraries (DLLs) it relies on  
- Whether the file looks **packed or obfuscated**  
- Potential capabilities via **imported APIs** (e.g., networking, process injection)

The following script loads the PE file and reads:

- **Entry Point** – where execution begins  
- **Image Base** – preferred loading address in memory  
- **Import Table** – external functions the binary relies on


In [15]:
import pefile

if os.path.isfile(sample):
    pe = pefile.PE(sample)
    print("Entry Point:", hex(pe.OPTIONAL_HEADER.AddressOfEntryPoint))
    print("Image Base:", hex(pe.OPTIONAL_HEADER.ImageBase))
    print("\nImported DLLs and first few functions:")
    print("=======================================")
    try:
        for entry in pe.DIRECTORY_ENTRY_IMPORT:
            print("DLL:", entry.dll.decode(errors="ignore"))
            for imp in entry.imports[:5]:
                name = imp.name.decode(errors="ignore") if imp.name else "None"
                print("   -", name)
            print()
    except AttributeError:
        print("[!] No import table found (file may be packed or malformed).")
else:
    print("[!] Please fix the 'sample' path above before running this cell.")


Entry Point: 0xa7f70
Image Base: 0x400000

Imported DLLs and first few functions:
DLL: WS2_32.dll
   - getsockname
   - listen
   - recv
   - closesocket
   - socket

DLL: VERSION.dll
   - GetFileVersionInfoW
   - VerQueryValueW
   - GetFileVersionInfoSizeW

DLL: COMCTL32.dll
   - ImageList_ReplaceIcon
   - ImageList_SetBkColor
   - ImageList_AddMasked
   - ImageList_BeginDrag
   - ImageList_EndDrag

DLL: FLTLIB.DLL
   - FilterSendMessage
   - FilterGetMessage
   - FilterReplyMessage
   - FilterConnectCommunicationPort

DLL: KERNEL32.dll
   - AcquireSRWLockExclusive
   - AcquireSRWLockShared
   - InitializeSRWLock
   - GetSystemInfo
   - VerSetConditionMask

DLL: USER32.dll
   - LoadStringA
   - DrawEdge
   - GetMessageW
   - TranslateMessage
   - DispatchMessageW

DLL: GDI32.dll
   - SaveDC
   - RestoreDC
   - SetBrushOrgEx
   - SetPixel
   - PatBlt

DLL: COMDLG32.dll
   - ChooseColorW
   - GetOpenFileNameW
   - PrintDlgW
   - ChooseFontW
   - FindTextW

DLL: ADVAPI32.dll
   - RegQuer

Common benign imports include:

- `kernel32.dll` – core OS functions (files, processes, memory, etc.)  
- `user32.dll` – user interface and GUI functions  
- `advapi32.dll` – advanced Windows API, registry, services  

If this were **malware**, suspicious imports could include:

- `CreateRemoteThread` (process injection)  
- `VirtualAllocEx` (remote memory allocation for shellcode)  
- `GetProcAddress` and `LoadLibraryA` (dynamic API resolution)  
- `WinExec` or `ShellExecuteA` (spawning child processes)


## YARA Analysis

**YARA** is a powerful tool for:

- Writing **detection rules**  
- Identifying **malware families**  
- Matching file characteristics in **SOC pipelines**

We will write a very simple YARA rule that triggers if the string `"http"` appears anywhere in the file.


In [16]:
import yara

rule_source = (
    'rule ContainsHTTP {'
    '\n    strings:'
    '\n        $s = "http"'
    '\n    condition:'
    '\n        $s'
    '\n}'
)

if os.path.isfile(sample):
    rules = yara.compile(source=rule_source)
    matches = rules.match(sample)
    print("YARA matches:", matches)
else:
    print("[!] Please fix the 'sample' path above before running this cell.")


YARA matches: [ContainsHTTP]


## Complete Static Triage Workflow

This section ties together all techniques seen so far into a single **static triage** script that:

1. Computes hashes  
2. Extracts readable strings  
3. Enumerates imports  
4. Identifies potential IOCs (URLs, IPs)  
5. Applies a simple YARA rule


In [25]:
import hashlib, pefile, re, yara, os

def compute_hashes(path: str):
    algos = ["md5", "sha1", "sha256"]
    output = {}
    for a in algos:
        h = hashlib.new(a)
        with open(path, "rb") as f:
            for chunk in iter(lambda: f.read(8192), b""):
                h.update(chunk)
        output[a] = h.hexdigest()
    return output

def extract_strings(path: str, min_length: int = 4):
    with open(path, "rb") as f:
        data = f.read()
    pattern = rb"[ -~]{" + str(min_length).encode() + rb",}"
    return re.findall(pattern, data)

if os.path.isfile(sample):
    print("=== HASHES ===")
    print(compute_hashes(sample))

    print("\n=== STRINGS (first 10) ===")
    strs = extract_strings(sample)
    for s in strs[:10]:
        print("-", s.decode(errors="ignore"))

    print("\n=== IMPORTS ===")
    pe = pefile.PE(sample)
    try:
        for entry in pe.DIRECTORY_ENTRY_IMPORT:
            print(entry.dll.decode(errors="ignore"))
    except AttributeError:
        print("[!] No import table found.")

    print("\n=== POTENTIAL IOCs ===")
    decoded = open(sample, "rb").read().decode(errors="ignore")
    urls = re.findall(r"https?://[^\s\"']+", decoded)
    ips = re.findall(r"\b\d{1,3}(?:\.\d{1,3}){3}\b", decoded)
    print("URLs:", urls)
    print("IPs:", ips)

    print("\n=== YARA RULE MATCHES ===")
    rule_source = r"""
    rule Simple {
        strings:
            $s = "http"
        condition:
            $s
    }
    """
    rule = yara.compile(source=rule_source)
    print(rule.match(sample))
else:
    print("[!] Please fix the 'sample' path above before running this cell.")


=== HASHES ===
{'md5': 'c3e77b6959cc68baee9825c84dc41d9c', 'sha1': 'bc18a67ad4057dd36f896a4d411b8fc5b06e5b2f', 'sha256': '3b7ea4318c3c1508701102cf966f650e04f28d29938f85d74ec0ec2528657b6e'}

=== STRINGS (first 10) ===
- !This program cannot be run in DOS mode.
- V*0T
- 0RichU
- .text
- `.rdata
- @.data
- .rsrc
- @.reloc
- hpqQ
- h`EN

=== IMPORTS ===
WS2_32.dll
VERSION.dll
COMCTL32.dll
FLTLIB.DLL
KERNEL32.dll
USER32.dll
GDI32.dll
COMDLG32.dll
ADVAPI32.dll
SHELL32.dll
ole32.dll
OLEAUT32.dll
SHLWAPI.dll
UxTheme.dll
dwmapi.dll
ntdll.dll

=== POTENTIAL IOCs ===
URLs: ['https://go.microsoft.com/fwlink/?LinkId=521839', 'https://go.microsoft.com/fwlink/?LinkId=521839', 'https://go.microsoft.com/fwlink/?LinkId=521839\\ul0\\cf0}}}}\\f0\\fs20', 'https://microsoft.com/exporting', 'https://microsoft.com/exporting}}}}\\f0\\fs19', 'http://www.microsoft.com/pkiops/crl/Microsoft%20Windows%20Third%20Party%20Component%20CA%202012.crl0\x06\x08+\x06\x01\x05\x05\x07\x01\x01\x04u0s0q\x06\x08+\x06\x01\x05\x05

## Reflection

After running the full workflow, answer the following questions (e.g. in your lab report or notes):

1. What **IOCs** (hashes, URLs, strings, imports) did you discover?  
2. Which artefacts would be most useful for a **SOC** or **threat intelligence** team?  
3. How would this workflow change if you were analysing **suspected malware** instead of a trusted tool?  
4. What are the limitations of **static analysis** compared to dynamic analysis and full reverse engineering?
