Skip to content

alanh0vx/vx_security_nerd

Repository files navigation

          ██╗
         ██╔╝
    ╔══════════╗
    ║  ◉    ◉  ║
    ║    ╰╮    ║
    ╚══════════╝
        ╔╧╗
        ╚═╝
  ██╗   ██╗██╗  ██╗██████╗ ██╗     
  ██║   ██║╚██╗██╔╝██╔══██╗██║     
  ██║   ██║ ╚███╔╝ ██████╔╝██║     
  ╚██╗ ██╔╝ ██╔██╗ ██╔══██╗██║     
   ╚████╔╝ ██╔╝ ██╗██║  ██║███████╗
    ╚═══╝  ╚═╝  ╚═╝╚═╝  ╚═╝╚══════╝
  Security Nerd — VXRL's AI Pentest Assistant

Security Nerd

An AI-powered pentesting assistant for Kali Linux with web and CLI interfaces. It runs an agentic loop: describe a task, an LLM decides which security tools to run, executes them, analyses the output, and iterates until the task is done.

Built for solo pentesters and CTF players. Works best on Kali Linux where all the security tools (nmap, sqlmap, gobuster, hydra, etc.) are pre-installed. Runs locally with Ollama or connects to cloud LLM providers (Anthropic, OpenAI, Gemini, OpenRouter).

For authorized security testing and research only. This tool is intended for white hat penetration testing, CTF competitions, security research, and educational purposes. Only use it against systems you own or have explicit written permission to test.

No agent frameworks — no LangChain, no AutoGen, no MCP. Just a clean agentic loop.

Where It Fits — Pentest Team Hierarchy

┌─────────────────────────────────────────────────────────┐
│                    ENGAGEMENT LEAD                      │
│           Scoping, client comms, final report           │
└──────────────────────┬──────────────────────────────────┘
                       │
          ┌────────────┴────────────┐
          │                         │
┌─────────▼──────────┐   ┌──────────▼──────────┐
│  SENIOR PENTESTER  │   │  SENIOR PENTESTER   │
│  Strategy, complex │   │  Specialized area   │
│  exploitation      │   │  (web, AD, cloud)   │
└─────────┬──────────┘   └──────────┬──────────┘
          │                         │
     ┌────▼─────────────────────────▼────┐
     │       ★  SECURITY NERD  ★         │
     │                                   │
     │  • Runs recon & enumeration       │
     │  • Executes tools automatically   │
     │  • Writes custom scripts/exploits │
     │  • Tracks findings & evidence     │
     │  • Generates reports              │
     │  • Remembers across sessions      │
     │                                   │
     │  Like having a tireless junior    │
     │  who knows every tool on Kali     │
     └───────────────────────────────────┘

The assistant handles the repetitive parts of pentesting — tool execution, output analysis, and documentation — so you can focus on strategy and creative exploitation. It's not a replacement for a skilled pentester, but it can save hours of manual work. Think of it as a productivity tool that happens to know its way around Kali.

Quick Start

git clone https://github.com/alanh0vx/vx_security_nerd.git
cd vx_security_nerd
chmod +x setup.sh run.sh
./setup.sh
./run.sh

This opens the web UI at http://127.0.0.1:8080. Create a session, set your target scope, and start hacking.

First-Time LLM Setup

Click LLM Configuration in the sidebar (or use /config in CLI) to choose your provider:

Provider Setup
Ollama (local) Install Ollama, pull a model (ollama pull qwen3-coder, or any larger one your GPU can handle), done
Anthropic Set ANTHROPIC_API_KEY env var, select provider
OpenAI Set OPENAI_API_KEY env var, select provider
Gemini Set GEMINI_API_KEY env var, select provider
OpenRouter Set OPENROUTER_API_KEY env var, select provider

The web config modal auto-detects available Ollama models and lets you set API keys in-browser.

Try the Demo

./run.sh --demo

Runs a pre-configured black box assessment against testfire.net (IBM's intentionally vulnerable app). Click the phase button (P1–P9) to load a suggested prompt for each phase, or type your own tasks manually.

Prerequisites

Dependency Version Notes
Python 3.11+ python3 --version
Kali Linux recommended All pentest tools pre-installed; works on other Linux but you'll need to install tools manually
Ollama latest (optional) Only if using local LLM — ollama serve
scrot any Screenshot capture (apt install scrot)
ffmpeg any Screen recording (apt install ffmpeg)

The setup script handles Python deps, venv, Playwright browser, and config automatically.

Usage

./run.sh                                  # Web UI (default) at http://127.0.0.1:8080
./run.sh --cli                            # CLI mode
./run.sh --cli --web                      # Both simultaneously
./run.sh --demo                           # testfire.net demo
./run.sh --help                           # All options

# Lab mode (HTB / OSCP / TryHackMe)
./run.sh --lab htb                        # HackTheBox — VPN, flags, machine tracking
./run.sh --lab oscp                       # OSCP — proof files, exam report format
./run.sh --lab thm                        # TryHackMe

# Projects (group sessions into engagements)
./run.sh --project htb-forest             # Open or create a named project
./run.sh --project htb-forest --lab htb   # Project + lab mode
./run.sh --project                        # Auto-named (nerd-proj-YYYYMMDD-01)
./run.sh --list-projects                  # List all projects
./run.sh --resume <session_id>            # Resume a session

Just Describe What You Want

>> Scan the target for open ports and services
>> Check if the web app login form is vulnerable to SQL injection
>> Find a way to escalate from www-data to root

The assistant picks the right tools, runs them, analyzes output, and continues.

Commands

Web-UI-only surfaces (no slash-command equivalent): the toolbar's Status, Log, and Phases sidebar buttons. Everything below is also available in the CLI.

Common commands work with or without / prefix:

Command Description
help Show all commands
phase / phase next / phase 5 Workflow phase tracking (also prints a suggested prompt for the new phase)
scope / scope 10.10.11.45 Show or set target scope
findings Show discovered vulnerabilities
export Generate Markdown + HTML report
config Configure LLM provider, model, API keys
exit Save session and quit

Additional commands (use / prefix):

Command Description
/tools List tools and install status
/provider [name] List or switch LLM provider
/model <name> Switch model
/think [on|off] Toggle thinking mode
/history [search <kw>] Session browser / search
/screenshot [caption] Take a screenshot
/record start|stop Screen recording
/lab <platform> Activate lab mode
/machine Machine info (lab mode)
/flags Captured flags (lab mode)
/writeup Generate HTB writeup
/vpn VPN status
/pivot add|map Pivot route management
/memory [forget] Cross-session memory
/project [list|info|findings] Project management

Workflow Phases

The assistant tracks 9 structured phases during an engagement, each with goals, tool checklists, exit criteria, and behavioral rules:

Phase 1: Reconnaissance      — Passive OSINT, no packets to target
Phase 2: Discovery & Mapping  — Host/port scanning, network topology
Phase 3: Enumeration          — Deep service-level enumeration
Phase 4: Vulnerability Analysis — CVE research, attack planning
Phase 5: Exploitation         — Initial access, prove the vuln
Phase 6: Post-Exploitation    — Privesc, lateral movement, data access
Phase 7: Persistence          — Maintain access (red team only)
Phase 8: Cleanup              — Remove artifacts, restore state
Phase 9: Reporting            — Generate reports, finalize findings

The assistant adapts per phase — in recon it won't run active scans, in exploitation it follows the attack plan from Phase 4. Each phase file in pentest_assistant/knowledge/workflow/ now has an explicit Scope Boundary section (what belongs here / what to defer to which other phase) and a strict Phase Behavior contract (DO / NEVER / ESCALATE), cross-mapped to PTES and NIST SP 800-115. This keeps the LLM from drifting — e.g., it won't try admin/admin probes during Discovery; that's a Phase 4 activity.

You don't have to follow the 9 phases. The workflow is a structured option, not a requirement. If you'd rather work freestyle — just describe tasks and let the assistant pick tools — set the phase to wherever your work sits (or leave it wherever it was) and type what you want: "scan the target", "try admin/admin on that login", "extract the database". The assistant will do it. The phase system is there when you want discipline and auditable structure; it's out of the way when you don't. The per-phase reports (✓ in the sidebar) still fill in based on what you did, whichever mode you used.

Phase dropdown (web UI): Click the phase button in the toolbar to pick any phase. Selecting a phase advances the session and prefills the chat input with a suggested prompt for that phase — you can edit it, add your own detail, or clear it and type something different before sending. Jumping ahead is safe: each template ends with a reminder that the assistant should gather any missing results from earlier phases first.

Phases sidebar + per-phase reports (web UI): A Phases section in the sidebar shows all 9 phases at a glance with status glyphs:

  • completed (had at least one turn, no longer current) — clickable
  • current (in progress) — clickable, opens the phase switcher
  • · pending (no activity yet) — inert/dimmed

Click a completed phase to open a per-phase report modal with:

  • Summary — manual narrative or LLM-generated, if you've ticked the phase (see below)
  • Stats — turns / elapsed / tokens / tool calls
  • Discoveries (auto-extracted from tool output) — open ports, subdomains, IPs, technology fingerprints, site tree. Surfaces target info even when the LLM didn't emit a formal <finding> tag
  • Findings — severity-tagged vulnerabilities created in that phase
  • Media — screenshots/recordings captured
  • Tool calls — the commands run

Toggle Show timeline for the full message transcript; Copy as Markdown copies the report to your clipboard. No LLM call is involved — assembled from session state, so it's free, instant, and always current.

Manual phase tick. Hover any phase row in the sidebar — a small ✓ button appears. Click to mark the phase complete even when no turns were auto-tracked (useful when the LLM did work that didn't get attributed to that phase, or when you did the work outside the assistant). The dialog gives you three options:

  • Tick — mark complete with no summary
  • Tick with custom summary — mark complete using whatever narrative you type in the dialog
  • Tick + LLM summary — mark complete and ask the LLM to generate a 400–600 word narrative from the phase's history (one LLM call, a few seconds)

Ticked phases get a glyph in the sidebar and their summary appears at the top of the per-phase report modal. Hover a completed phase and the ✓ becomes an × to untick. Stats from real activity (if any) are unaffected by ticking/unticking.

The per-phase modal is a snapshot, not the final deliverable — /export (or the Reporting phase) still produces the comprehensive client-style report.

Skipping phases is fine. Phase progression is advisory, not gated. Jumping from Phase 1 to Phase 5 doesn't block anything; the system prompt nudges the assistant to backfill prerequisites on its own. The final report (and the Phase 9 modal) explicitly notes which phases had no recorded activity, so the gap is visible to the reader rather than hidden.

CTF / lab mode: The 9-phase model is advisory in lab mode (HTB / OSCP / TryHackMe). Boxes commonly skip Persistence (7) and Cleanup (8), and may collapse Recon → Discovery → Enumeration into a single pass when the target is a single IP. Three things change automatically when a session has lab_meta set:

  • Persona swap. A separate knowledge/persona_ctf.md is loaded instead of the standard pentester persona — geeky CTF-player voice, flag-focused, rabbit-hole-aware, comfortable with loud scans (no defender), familiar with HTB / OSCP / THM platform quirks (user.txt / root.txt / local.txt / proof.txt flag locations, etc.).
  • Phase templates rewritten for CTF. Lab sessions get a parallel LAB_PHASE_TEMPLATES dict where every phase template is tuned for a single-target box. The {TARGET} placeholder is auto-substituted with lab_meta.target_ip. Phase 1 says "skip OSINT, you have an IP". Phase 2 hands you the exact nmap command for the target. Phase 5 reminds you to grab user.txt immediately after foothold. Phases 7–8 are explicitly skipped.
  • Skip Phase 1 by default. New lab sessions start at current_phase = 2 (Discovery) so you don't waste a turn skipping passive recon.
  • Phases sidebar default-collapsed. The 9-phase methodology is overkill for CTF, so the sidebar section starts collapsed (header still visible — click to expand if you want it). Manual toggles are remembered, so the auto-collapse only fires once.
  • 🚩 Kickoff button + auto-prefill. Each lab template ships with a platform-specific kickoff prompt that names the right flag format and locations — "capture both HTB{...} flags from ~/user.txt and /root/root.txt" for HTB, "capture local.txt (low-priv) and proof.txt (root/SYSTEM), screenshot each shell with ifconfig visible, no Metasploit" for OSCP, "work the room, flags follow THM{...} format" for THM, with {TARGET} / {MACHINE_NAME} / {DIFFICULTY} / {OS} substituted from lab_meta. A fresh lab session (empty history) auto-prefills it into the chat input on first connect; the toolbar 🚩 Kickoff button (visible only in lab mode) re-prefills on demand. Custom user-defined templates fall back to a generic kickoff prompt unless they ship their own kickoff_prompt field.

CLI: /phase <1-9> does the same as the toolbar button — it switches the phase and prints the suggested prompt as a copyable hint.

Features

Web UI

  • Real-time streaming chat with markdown rendering
  • Session management (create, switch, delete)
  • Scope management sidebar
  • Findings tracker with severity sorting and filtering
  • Phase navigation toolbar with prompt-prefilling dropdown (P1–P9)
  • Phases sidebar with per-phase status (✓ completed / ● current / · pending) and a click-to-open per-phase report modal (rendered markdown, copy as markdown, optional timeline)
  • LLM configuration modal (provider, model, API key)
  • Export reports (Markdown + HTML)
  • Live tool execution spinner with elapsed/budget counter for long scans (nikto, gobuster, etc.)
  • Refresh-safe turns: close the tab or refresh mid-scan, reconnect, and pick up exactly where you left off — history replay + in-progress spinner rebuild with correct elapsed time
  • Cancel button: stop a runaway turn instantly — kills the in-flight subprocess and releases the session lock (checks cancel mid-stream too, not just between tools)
  • Status button: one click asks the assistant for the mandatory phase-status block ("Status / Current phase / Next"). Complemented by a backend detector that flags turns ending without a status summary
  • Log button: per-session diagnostic log (turn_log.jsonl) with every LLM iteration, tool call, cancel, and safety-net trip in order — colour-coded table for post-mortem when something drifts
  • First-visit welcome panel: empty chat area renders a welcome + onboarding panel for new users (Create session / Try demo / Configure LLM), explaining the phase-driven vs freestyle workflows

Web Testing — Browser-Only (No curl)

Every HTTP request against a live web target goes through a Playwright tool, not curl. Patching curl's User-Agent only fixes one of ten fingerprint dimensions WAFs check — TLS stack, JA3 hash, header ordering, TCP options, and more all give curl away. The assistant uses headless Chromium for everything, including API calls:

  • playwright_open / playwright_check_headers / playwright_find_forms for browsing and page inspection
  • playwright_http for raw HTTP requests with arbitrary methods, custom headers, JSON or form bodies — the full curl replacement. Uses Chromium's APIRequestContext so TLS fingerprint is real and any cookies from a prior playwright_auth_flow are auto-shared.
  • playwright_eval for JavaScript in page context

The only times the assistant uses curl are downloading binaries to disk and SSRF exploitation (where you are the server being called). In practice this means dramatically fewer mysterious 403/406/429 dead ends during recon, and API testing that actually works against Cloudflare-fronted targets.

Dynamic Tool Timeouts

Tools default to a 120-second timeout, but the assistant can request a longer budget per invocation (<tool timeout="600">nikto ...</tool>) for scans it knows will take a while. Known-slow tools have higher static baselines (nikto/gobuster/sqlmap = 600s, hydra = 900s). The executor clamps requests to tool_timeout_max (default 1800s) and the UI shows a live elapsed/budget counter + "don't interrupt" hint so long scans don't look stuck.

Open-Source First — Kali's Built-in Arsenal, Then Write Your Own

Every phase leads with Kali-bundled OSS tools before anything cloud-hosted or paid. The assistant is told explicitly (in rules.md → "Third-Party APIs") to never substitute a placeholder like apikey=demo — if a paid service's key isn't in the environment, it uses the open-source alternative or skips the service. In practice that means:

  • Recon: crt.sh (CT logs, no key), Wayback Machine CDX API, sublist3r, theHarvester free sources, whois, dig, dnsenum, dnsrecon — not Shodan/Censys/Hunter/SecurityTrails (unless their keys are set).
  • Discovery/Enum: nmap, gobuster, ffuf, feroxbuster, nikto, whatweb, wafw00f, wpscan, nuclei, Playwright for anything HTTP.
  • Vuln Analysis: searchsploit (local ExploitDB), msfconsole search, nuclei -t http/cves/, zap-baseline, wapiti, sqlmap — not Nessus Pro, Qualys, or Burp Pro.
  • Exploitation: Metasploit Framework, ExploitDB, community PoCs from GitHub — not Cobalt Strike / Core Impact / commercial kits.
  • Post-exploitation: linpeas/winpeas, pspy, impacket-*, netexec (modern crackmapexec), evil-winrm, BloodHound Community — all free and Kali-bundled.

Custom Script Generation — the LLM's Biggest Advantage

When a stock tool doesn't quite fit, the assistant writes its own program. Code generation is something the LLM is genuinely fast and accurate at — and it's where it often beats a human operator:

  • Custom confirmation probes — test one hypothesis about one app, without a generic scanner's noise
  • Custom exploits — adapt public PoCs to the target's specific quirks, chain primitives (SSRF → internal API → deserialisation), fuzz a specific parameter format
  • Automation scripts — chain enumeration steps, parse output, correlate findings
  • Protocol tools — custom HTTP clients, binary protocol interaction, crypto ops
  • Post-exploitation — app-specific credential harvesters, targeted privesc checks, network-topology-aware pivot scripts

Scripts are saved to /tmp/probe_<name>.py so the assistant can iterate and debug. The policy is "prefer existing tool → prefer Kali OSS → write your own" — don't reinvent the wheel (if admin:admin works, don't write a brute forcer), but also don't stretch a generic scanner into something it was never built for.

>> The login form has a custom token. Write a script to brute force it.

  Writing custom brute-force script...
  <tool>shell cat << 'SCRIPT' > /tmp/brute_token.py
  #!/usr/bin/env python3
  import requests, sys, re
  target = sys.argv[1]
  for pwd in open('/usr/share/wordlists/rockyou.txt'):
      pwd = pwd.strip()
      # Fetch fresh CSRF token each attempt
      s = requests.Session()
      r = s.get(f'http://{target}/login')
      token = re.search(r'name="token" value="(.+?)"', r.text).group(1)
      r = s.post(f'http://{target}/login', data={'user':'admin','pass':pwd,'token':token})
      if 'Dashboard' in r.text:
          print(f'[+] Found: admin:{pwd}')
          break
  SCRIPT</tool>
  <tool>shell python3 /tmp/brute_token.py 10.10.11.45</tool>

It won't write a kernel exploit from scratch, but for glue code, adapting public PoCs, and automating multi-step tasks — it's surprisingly handy.

Safety & Diagnostics

Agent loops misbehave. The system has three layers of defense plus a diagnostic log:

Runaway protection (config keys with defaults):

  • max_tools_per_response (15) — tool tags accepted from a single LLM response. Excess dropped; LLM gets a system note: "dropped N tool tag(s), slow down."
  • max_tools_per_turn (30) — total tools across all iterations of a turn. Hard backstop.
  • dup_burst_threshold (5) — the last N commands must not all normalise identically. Catches enumeration runaways like curl /v300, /v301, /v302, /v303, /v304 after the 5th call.
  • Cancel mid-stream — the Cancel button now interrupts the LLM response while it's streaming, not just between iterations. Previously a racing LLM could emit 30 tool tags and execute them all before the next cancel boundary.

Per-session turn log at ~/.pentest_assistant/sessions/<id>/turn_log.jsonl. Every event in chronological order:

Event Fields
turn_start user_input, phase
llm_iter depth, n_tool_calls, dropped_in_response, response_chars
tool_start / tool_end command, exit_code, elapsed_s, timed_out, cancelled, scope_error
cancel_requested / cancel_applied source, where (stream / boundary), had_active_subprocess
cap_hit cap_kind, limit, dropped
dup_burst threshold, pattern
missing_phase_status phase, response_chars
turn_end status, elapsed_s, total_tools, iterations

Open with the Log button in the toolbar for a colour-coded table, or just jq the file from the shell.

Phase-status enforcement — a rule in rules.md (plus a deterministic detector in the backend) requires every turn to close with a 3-line status block: Status / Current phase (with exit-criteria progress) / Next. When the phase is complete, the assistant says "Phase N is complete" verbatim. If the LLM skips it, the backend logs missing_phase_status and the UI shows an inline nudge with an [Ask status] button.

Third-party APIs policy (in rules.md) — the assistant is hard-banned from using placeholder keys (no apikey=demo, no api_key=test). Paid services (Shodan, Censys, Hunter, SecurityTrails, ViewDNS, CertSpotter, VirusTotal) are only acceptable when the corresponding env var is actually set. If not set, the assistant uses the OSS alternative (crt.sh, Wayback CDX, sublist3r, theHarvester free sources) or skips the service. No more 401-storm on every recon turn.

Dual Frontend

CLI and Web UI share the same PentestAssistant backend — same agentic loop, same knowledge files, same providers. The Web UI adds a few diagnostic surfaces the CLI doesn't expose (per-phase report modals, the Log viewer, the Status button, the Phases sidebar, and the first-visit welcome panel); the CLI has everything needed for a full engagement and the same slash-command set.

Knowledge System

Deep methodology knowledge loaded contextually into the system prompt:

  • Engagement types: Black box, grey box, white box — each with different tool priorities and approaches
  • Playbooks: PTES, OWASP Top 10, Linux/Windows privesc, Active Directory, exploit research
  • Phase guides: Per-phase behavioral rules and checklists

Customizable Personality & Knowledge

The assistant's behavior is driven entirely by editable markdown files in knowledge/ — no code changes needed:

knowledge/
├── persona.md           # Personality, communication style, decision-making
├── rules.md             # Engagement rules (scope, ethics, safety)
├── methodology/         # Black/grey/white box approaches
│   ├── black_box.md
│   ├── grey_box.md
│   └── white_box.md
├── workflow/            # Per-phase guides (01_recon through 09_reporting)
└── playbooks/           # Deep methodology guides
    ├── owasp_web.md
    ├── privesc_linux.md
    └── ...

Safe to edit:

  • persona.md — change the tone, expertise level, verbosity, or focus areas. Want a more aggressive red teamer? A patient teacher for beginners? Just edit the persona. This won't break anything.
  • methodology/*.md — adjust the approach for each engagement type
  • playbooks/*.md — add new playbooks or modify existing ones (e.g., add cloud security, IoT, mobile)
  • workflow/*.md — customize phase checklists, exit criteria, tool priorities

Be careful editing:

  • rules.md — these are safety guardrails (scope enforcement, destructive command blocking, ethics). Relaxing these could lead to out-of-scope testing or data loss. Edit with understanding.

Won't break the system: The knowledge files are injected into the LLM's system prompt as context. Changing them changes the assistant's behavior, not the application logic. The worst case from a bad edit is the LLM giving poor advice — just revert the file.

Want to add a new playbook? Create a .md file in knowledge/playbooks/ and it can be loaded contextually. Fork the repo and customize it for your team's methodology.

Cross-Session Memory

The assistant remembers across sessions: target knowledge, credentials, user preferences, tool notes, patterns, and lessons learned.

Lab Mode (experimental)

When creating a new session (web UI or CLI), you can select a lab platform from the dropdown — HackTheBox, OSCP, or TryHackMe. This changes how the assistant behaves:

Platform Effect
HackTheBox Watches for 32-char hex flags in tool output, tracks user/root flags, generates HTB writeups
OSCP Tracks local.txt/proof.txt, screenshots with ifconfig visible, exam-style report format
TryHackMe Room-based tracking, task/flag format

Lab mode also enables VPN detection (tun0 interface), machine metadata tracking, pivot mapping, and loads both Linux and Windows privesc playbooks. No stealth needed — it goes fast and thorough.

You can switch lab platform anytime with /lab htb, /lab oscp, or /lab thm. These templates are best-effort starting points — they give the assistant useful context like flag formats and report styles, but they don't guarantee it will solve any particular box. The actual results depend on the LLM's capability, the target's complexity, and the tools available. Currently the three platform templates are built-in; adding custom templates would require editing lab_manager.py.

Projects

Group sessions into engagements. Come back tomorrow and pick up where you left off — scope, findings, and credentials persist.

Configuration

Config lives at ~/.pentest_assistant/config.json. Use the web UI config modal or CLI /config command to modify.

{
  "active_provider": "ollama",
  "providers": {
    "ollama": {
      "base_url": "http://localhost:11434",
      "model": "qwen3-coder-next:latest"
    },
    "anthropic": {
      "api_key_env": "ANTHROPIC_API_KEY",
      "model": "claude-sonnet-4-5"
    }
  },
  "max_tool_depth": 10,
  "tool_timeout_default": 120,
  "tool_timeout_max": 1800
}

API keys are never stored in the config file — only environment variable names. Set keys in your shell:

export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."
export OPENROUTER_API_KEY="sk-or-..."

Supported LLM Providers

Provider Protocol Thinking Support
Ollama Ollama API Model-dependent
Anthropic Messages API Extended thinking
Gemini Google AI API thinkingConfig
OpenRouter OpenAI-compatible Model-dependent
OpenAI Chat API o3-mini reasons natively
Custom OpenAI-compatible Varies

Data Storage

All data stored locally at ~/.pentest_assistant/:

~/.pentest_assistant/
├── config.json              # Configuration (no secrets)
├── history.db               # SQLite session index + FTS
├── memory.json              # Cross-session memory
├── sessions/                # Session data (600 permissions)
├── projects/                # Project data
└── reports/                 # Generated reports

Troubleshooting

Assistant seems stuck or stopped

Long LLM turns can look idle — a big model may spend a minute thinking, or a streaming response can stall between chunks. If nothing appears to be happening, just type something into the chat:

  • Ask for status: "what's your current status? what are you working on?"
  • Ask it to keep going: "continue" or "carry on with the current task"
  • Redirect it: send a new instruction and it will pick up from there

Typing a new message always wins — it either unblocks the current turn or starts the next one. There's also a 60-second stall watchdog that will surface a warning in the UI if a turn goes silent for too long, but you don't need to wait for it. When in doubt, just talk to it.

Running a legitimately slow tool? When the assistant runs something like nikto, gobuster, sqlmap, or hydra, the tool box shows a live elapsed / budget counter (e.g. 0:42 / 10:00) with a pulsing indicator and a "don't interrupt — this is expected" hint. While a long-running tool is in flight the stall watchdog is suppressed, so the UI won't tell you it looks stuck when a scan is just doing its job. Wait for the counter to finish, or press Cancel in the toolbar if you need to stop it.

Refreshed the browser, closed the tab — is my turn lost?

No. The agent runs in the background regardless of what the browser is doing, and the chat is designed to survive reconnects:

  • Refresh mid-turn — the page reloads, the WebSocket reconnects, and the new page immediately gets a history replay (all your past messages, tool outputs, and findings rebuilt in the chat area) plus a live resume event that rebuilds the in-progress streaming bubble and any active tool spinner — with the correct elapsed time, not reset to zero. You pick up exactly where you left off.
  • Close the tab and come back later — same mechanism. The turn kept running in the background and wrote all its results to disk. Reopen the session and you'll see the completed run in the chat history.
  • Kill the server process — this is the one failure mode: whatever hadn't been flushed to disk yet is lost. The last completed tool result and all prior findings are safe, but the LLM's current partially-streamed response isn't.

You can also cancel a running turn explicitly: click the red Cancel button in the toolbar (only visible while a turn is active). It sets a cancellation flag the agent checks between iterations, AND it sends SIGTERM to the process group of any in-flight subprocess (so a 10-minute nikto scan actually stops, it doesn't wait for the budget to expire). The chat shows "Turn cancelled by user" and the session unlocks immediately so you can send a new message.

A concurrency guard prevents sending a new message while a turn is already running — you'll see a ⚠ A turn is already running on this session. Wait for it to finish, or press Cancel to stop it. notice. This avoids race conditions where two turns would clobber each other's state.

Why this works: the assistant is stateless between turns — all the "memory" lives in the chat history. Every time you send a message, the entire conversation (your prompts, the assistant's reasoning, every tool command and its output, findings, scope, current phase) is re-sent to the LLM as context. So when you type "what's your status?" the model literally re-reads everything it just did and answers from that. When you type "continue" it re-reads the last tool result and decides the next step, exactly as if the loop had never paused. There is no hidden in-flight state to lose — the session file and chat history are the state. That's also why you can close the browser tab, come back tomorrow, resume the session, and the assistant picks up coherently: same mechanism, longer gap.

Tool timed out at 120s (nikto, gobuster, full port scans...)

Most tools default to a 120-second timeout, which is plenty for quick commands but not enough for deep scans. There are three ways to give a tool more room:

  1. Per-tool static defaults — nikto, gobuster, sqlmap now default to 600s, hydra to 900s, sublist3r to 300s. You don't need to configure anything for these; the assistant will use the longer budget automatically.
  2. Per-invocation override from the LLM — the assistant can request a longer budget for any specific command by adding a timeout="N" attribute to the tool tag: <tool timeout="600">masscan -p1-65535 10.10.11.45 --rate 1000</tool>. It's clamped to tool_timeout_max (default 1800s / 30 min). No config change needed.
  3. Raise the global defaults — if you regularly hit the ceiling, edit ~/.pentest_assistant/config.json:
    {
      "tool_timeout_default": 120,
      "tool_timeout_max": 3600
    }
    tool_timeout_default is the baseline for any tool without its own default; tool_timeout_max is the hard ceiling applied to every timeout="N" request.

While a long-running tool is in flight, the web UI shows a live elapsed / budget counter (0:42 / 10:00) with a pulsing indicator and a "don't interrupt" hint, and the stall watchdog is suppressed for the duration — so you won't get a false "looks stuck" placeholder while nikto is legitimately running.

Running in a VM (VirtualBox / VMware)

If you run Kali in a VM and the assistant stalls when the VM locks or goes to background:

What happens: VM hypervisors throttle CPU/GPU on background VMs. If using Ollama (local LLM), inference stalls. If using a cloud provider, the TCP connection dies server-side while the VM is paused — and on resume, the client is stuck waiting on a dead socket.

What the code does about it: All LLM providers now have a 120-second per-chunk read timeout. If no data arrives for 120s (VM paused, Ollama crashed, network died), the provider raises an error and the built-in retry logic (3 attempts, exponential backoff) restarts the LLM call automatically. The assistant self-heals without you having to type "continue."

What you should do on the VM side:

# Disable screen lock (simplest fix)
gsettings set org.gnome.desktop.screensaver lock-enabled false

# Disable auto-suspend
gsettings set org.gnome.settings-daemon.plugins.power sleep-inactive-ac-type 'nothing'
gsettings set org.gnome.settings-daemon.plugins.power sleep-inactive-battery-type 'nothing'

# Disable screen blanking
xset s off && xset -dpms

In VirtualBox: keep the execution cap at 100% (Settings -> System -> Processor).

Ollama unreachable

# Check Ollama is running
curl http://localhost:11434

# If using a remote server, set the URL in /config or config.json
# Or switch to a cloud provider:
/config

Ollama read timeout on long conversations

If you see LLM call failed: Read timed out. (read timeout=120) repeating in the server log after a session has grown a while, you're hitting the prompt-eval wall. Pentest sessions routinely push input past 30k tokens, and prompt-eval on a large model can legitimately take a minute or more before the first token streams. The client's read timeout was raised from 120s to 300s as of this change, with three knobs now in ~/.pentest_assistant/config.json under providers.ollama:

{
  "providers": {
    "ollama": {
      "base_url": "http://192.168.50.124:11434",
      "model": "qwen3-coder-next:latest",
      "read_timeout": 300,
      "num_ctx": 65536,
      "num_predict": -1
    }
  }
}
  • read_timeout — per-chunk read timeout. 300s is the new default. Raise to 600 or 900 on slow hardware; lower only if you want faster dead-connection detection.
  • num_ctx — sent to Ollama on every request as an explicit option. 0 lets the server decide (uses OLLAMA_CONTEXT_LENGTH env or model default). Set to a concrete number (e.g. 65536) to lock the context window and avoid silent mid-conversation truncation.
  • num_predict — max tokens to generate per response. -1 is unlimited (recommended).

When a turn's prompt passes 80% of num_ctx, the UI now shows an amber banner warning that the next turn may truncate or stall. The banner has a Compact & Continue button for one-click recovery (see below). Alternatively, /export and start a fresh session with the scope + a one-line summary.

Compact & Continue — keeping long sessions alive

When context gets tight, you don't have to abandon the session. Click Compact & Continue in the pressure banner and the backend:

  1. Snapshots the session to ~/.pentest_assistant/sessions/<id>/pre_compact_<ts>.json (reversible — a bad summary is always recoverable).
  2. Calls the LLM once with a dedicated summariser prompt that preserves findings, credentials, CVEs, IPs, URLs, and payloads verbatim while dropping verbose tool output and analysis prose.
  3. Replaces all but the last 3 history messages with a single synthetic [Compacted summary of turns 1–N] entry.
  4. Reloads the chat — old bubbles collapse into a single click-to-expand "📋 Compacted summary" block.

Context typically drops from 100% → 10–20% in one call. Findings, scope, phase, and cross-session memory are untouched — those live outside history. The next turn sees the summary + the last 3 turns + whatever you type next.

Config knobs (providers.ollama / root of config.json):

Key Default Purpose
auto_compact_at_pct 0 (off) Auto-compact before the next turn runs, once context crosses this %. Set to 95 once you trust compaction quality on your model.
compact_keep_last_n 3 How many recent messages stay verbatim. 3–5 is the sweet spot.

Turn log events emitted: compact_started, compact_done (with before/after message+char counts + snapshot path), compact_failed (with reason — LLM error or summary too short), compact_restored (if the snapshot was rolled back).

Limitations: compaction is lossy by design. It's optimised to preserve the findings / credentials / attack chain, not verbatim chat history — if you need every tool output intact for a specific artefact, run /export first. The snapshot on disk keeps the original conversation around for forensic inspection, just not active in-context.

On the Ollama server side (recommended for a dedicated inference box):

sudo systemctl edit ollama
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_ORIGINS=*"
Environment="OLLAMA_CONTEXT_LENGTH=65536"
Environment="OLLAMA_KEEP_ALIVE=2h"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
sudo systemctl daemon-reload && sudo systemctl restart ollama
Env var Why
OLLAMA_CONTEXT_LENGTH=65536 Default num_ctx. Match your client num_ctx setting. 64k is plenty for pentest sessions.
OLLAMA_KEEP_ALIVE=2h Avoids the 51 GB cold-reload stall between turns.
OLLAMA_FLASH_ATTENTION=1 2–4× speedup on prompt eval for large contexts. Required for q8_0 KV cache.
OLLAMA_NUM_PARALLEL=1 Single-user; frees KV-cache memory for larger contexts.
OLLAMA_KV_CACHE_TYPE=q8_0 Halves KV memory with negligible quality loss.

Pre-warm once after restart so your first request isn't paying the load cost:

curl -s http://localhost:11434/api/generate -d '{"model":"qwen3-coder-next:latest","prompt":"hi","stream":false,"keep_alive":"2h"}' > /dev/null

Missing tools

Run /tools to check install status. Install with sudo apt install <tool>.

Playwright issues

playwright install chromium
sudo playwright install-deps  # if needed

Limitations

Worth being honest about:

  • Not a replacement for a skilled pentester. The LLM makes mistakes, misreads output, and sometimes chases dead ends. Supervision required.
  • Quality depends on the LLM. Larger models give better results. There's a real quality/cost/speed tradeoff.
  • Human-in-the-loop by design. Each phase still runs through one turn at a time — you pick the phase, review the prefilled prompt, and hit Send. There is no unattended multi-phase "auto-pilot" mode; that was removed because it tended to freeze on long turns and desync the phase state. Use the P1–P9 dropdown to keep the engagement moving under your supervision.
  • Findings are LLM-generated, not signature-based. The assistant creates findings from its analysis of tool output — there's no CVE database or vulnerability signature matching. This means findings depend on the LLM's interpretation, and should be verified manually. It's a pentest assistant, not a vulnerability scanner.
  • Reports need human review. Generated reports capture evidence but need editing for client delivery.
  • Check your exam rules. Some certifications (e.g., OSCP) prohibit the use of AI/LLM tools during the exam. This assistant is great for practice and lab work, but make sure you understand what's allowed before using it in any exam setting.
  • It's v0.1.0. Rough edges exist. Feedback helps.

Contributing

This project is a work in progress. There's plenty of room for improvement, and contributions are genuinely appreciated.

  • Fork and PR — improvements to tools, providers, playbooks, UI, and docs are all welcome
  • Issues — bug reports and feature ideas via GitHub Issues
  • Playbooks — add new methodology guides in pentest_assistant/knowledge/playbooks/

See the knowledge/ directory for how methodology and playbooks are structured.

Disclaimer

This tool is provided for authorized security testing, research, and educational purposes only. You are solely responsible for ensuring you have proper authorization before testing any system. Unauthorized access to computer systems is illegal. The authors assume no liability for misuse of this software.

Always obtain written permission before conducting penetration tests. When in doubt, don't.

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors