Lab 15 — Automation

Requirements

Python 3.10+, Kali Linux, nmap, whois, dnsutils, curl.

sudo apt update && sudo apt install -y nmap whois dnsutils curl
pip3 install requests python-dotenv

Part 1 — `scanner.py`

Concurrent TCP port scanner using asyncio with a Semaphore rate limit.

How to run

# Scan localhost ports 1-1024 with default settings
python3 scanner.py 127.0.0.1

# Scan specific ports
python3 scanner.py 192.168.1.10 --ports 22,80,443,8080

# Scan a range with custom rate and save JSON output
python3 scanner.py 192.168.1.10 --ports 1-10000 --rate 500 --timeout 0.5 --output results.json

Design choices

asyncio over threading: a single event loop with cooperative scheduling is more efficient than OS threads at high concurrency (hundreds of simultaneous connections), because threads carry per-thread stack overhead.
Semaphore for rate limiting: asyncio.Semaphore(rate) caps concurrent connections without restructuring the code. This prevents both resource exhaustion on the scanner and IDS/rate-limit triggering on the target.
argparse from the start: hardcoded targets make scripts single-use. Parameterising everything makes the tool reusable across engagements.

Answer — why do false negatives appear at `--rate 2000`?

At very high concurrency, the operating system exhausts its per-process file descriptor limit and the kernel's ephemeral port range. When the OS cannot allocate a new socket, the connect() call raises an OSError rather than timing out cleanly — which the scanner interprets as "port closed". The port is actually open; the scanner simply could not reach it due to local resource exhaustion.

This matters operationally: "the scanner did not detect it" is never the same claim as "the port is closed." Any tool — including nmap — can produce false negatives if configured too aggressively, if a firewall silently drops packets, or if the target is rate-limiting incoming connections. Scan results are evidence of what was observable at the time of the scan, not proof of absence.

Part 2 — `parse_scan.py`

Parses nmap XML output and enriches SSH hosts with their host key type.

How to run

# Generate the XML first
nmap -sV --open -oX scan.xml 192.168.1.0/24

# Parse and enrich
python3 parse_scan.py --input scan.xml --output hosts.json

Design choices

xml.etree.ElementTree (stdlib) instead of a third-party nmap library: no extra dependency, and reading the XML directly teaches the underlying structure.
Independent SSH enrichment: ssh-keyscan runs only for hosts with port 22 open. Each call is wrapped in a try/except with a timeout=6 so a non-responsive host doesn't block the entire script.
None for missing SSH key type: explicit null in JSON is unambiguous; it distinguishes "we checked and found nothing" from "we didn't check."

Answer — why is a version banner like `Apache httpd 2.4.54` dangerous?

The version string is direct intelligence: it lets an attacker query CVE databases (NVD, Exploit-DB) for known vulnerabilities in that exact release without sending a single additional packet. A server returning Server: Apache forces the attacker to run version-fingerprinting probes, which are slower, noisier, and more likely to trigger IDS alerts. Hiding the version string does not fix underlying vulnerabilities, but it increases the attacker's cost and reduces the signal available to automated scanners.

Part 3 — `auth_analysis.py` and `log_analysis.py`

How to generate test logs

# auth.log (SSH brute-force simulation)
python3 - <<'EOF'
import random, datetime
ips = ["10.0.0.1", "10.0.0.2", "185.220.101.5", "192.168.1.50", "45.33.32.156"]
users = ["root", "admin", "ubuntu", "daniel"]
now = datetime.datetime.now()
with open("auth.log", "w") as f:
    for _ in range(500):
        ip = random.choices(ips, weights=[2, 2, 40, 1, 30])[0]
        user = random.choice(users)
        ts = (now - datetime.timedelta(seconds=random.randint(0, 86400))).strftime("%b %d %H:%M:%S")
        f.write(f"{ts} kali sshd[1234]: Failed password for {user} from {ip} port {random.randint(40000,60000)} ssh2\n")
    for _ in range(20):
        ts = (now - datetime.timedelta(seconds=random.randint(0, 86400))).strftime("%b %d %H:%M:%S")
        f.write(f"{ts} kali sshd[1234]: Accepted publickey for daniel from 192.168.1.1 port {random.randint(40000,60000)} ssh2\n")
EOF

# access.log (web server with injected attack traffic)
python3 - <<'EOF'
import random, datetime
ips = ["10.0.0.1", "185.220.101.5", "45.33.32.156", "66.249.66.1", "192.168.1.50"]
normal_paths = ["/", "/index.html", "/about", "/contact", "/static/main.css"]
attack_paths = [
    "/?id=1' UNION SELECT 1,2,3--",
    "/admin/../../../etc/passwd",
    "/search?q=<script>alert(1)</script>",
    "/wp-admin/",
    "/cgi-bin/test.cgi?cmd=id",
]
now = datetime.datetime.now()
with open("access.log", "w") as f:
    for hour in range(24):
        count = random.randint(80, 120)
        if hour == 3:
            count = 950
        for _ in range(count):
            ip = random.choices(ips, weights=[30, 5, 5, 10, 3])[0]
            path = random.choices(normal_paths + attack_paths, weights=[20]*5 + [1]*5)[0]
            status = 200 if path in normal_paths else random.choice([200, 403, 500])
            ts = (now.replace(hour=hour, minute=random.randint(0,59))).strftime("%d/%b/%Y:%H:%M:%S +0000")
            f.write(f'{ip} - - [{ts}] "GET {path} HTTP/1.1" {status} {random.randint(200,5000)}\n')
EOF

How to run

python3 auth_analysis.py --log auth.log --threshold 10
python3 log_analysis.py --log access.log --report report.md

Design choices

defaultdict and Counter: accumulate counts without initialisation boilerplate.
Streaming line-by-line (for line in Path(...).open()): processes multi-gigabyte logs without loading them into memory.
3-sigma threshold computed from the actual data distribution, not a hardcoded number: the baseline adapts to the server's real traffic level.
Combined report.md written at the end of log_analysis.py so all findings are in one place.

Answer — why does the 3-sigma rule fail on periodic traffic?

Web traffic has strong daily periodicity: a server serving business users sees 2,000 req/hr at 14:00 and 50 req/hr at 04:00. A single global baseline mixes these together, producing a high standard deviation that makes the threshold too permissive during business hours (real attacks blend in) and too sensitive at night (normal low-traffic hours trigger as "anomalous").

A more robust approach segments the baseline by time bucket: compare 03:00 traffic only against other historical 03:00 readings (same hour, different days). This normalises the daily cycle before computing the deviation, producing a time-aware baseline that flags genuine spikes within each typical traffic stratum instead of comparing apples to oranges.

Part 4 — `recon.py`

Integrated multi-stage reconnaissance tool. Supports both domain and IP modes, writes structured JSON, a Markdown report, and a full audit log.

How to run

# Domain mode (auto-detected)
python3 recon.py scanme.nmap.org --verbose

# IP mode (explicit)
python3 recon.py 45.33.32.156 --mode ip --output ./my_recon/ --verbose

# Auto-detect IP
python3 recon.py 45.33.32.156 --verbose

Output structure

recon_<target>_<timestamp>/
├── results.json   — all findings as structured data
├── report.md      — human-readable summary
└── audit.log      — timestamped record of every action

Design choices

Each step fails independently: every tool call goes through run(), which catches TimeoutExpired, FileNotFoundError, and general exceptions individually. A missing whois binary does not crash the DNS enumeration.
Structured data, not raw text: every parser extracts fields into a dict. results.json is machine-readable and can be fed into downstream scripts without regex post-processing.
Audit log is non-negotiable: penetration tests require proof of what the tool did and when. The log records every command, its return code, and the timestamp. It also captures tool unavailability (NOT_FOUND) so results can be interpreted correctly.
Missing security headers flagged in report: CSP, HSTS, X-Frame-Options, and X-Content-Type-Options are the minimum baseline. Their absence is a finding, not just neutral information.

Answer — active vs. passive reconnaissance (Shodan vs. your tool)

Your tool (active recon) sends packets directly to the target: nmap SYNs, DNS queries with your IP as source, HTTP HEAD requests, whois lookups that may be logged. Every one of these leaves a trace in the target's logs and in network monitoring infrastructure. A defender with IDS, firewall logs, or a SIEM will see your IP appear in DNS queries and in nmap's TCP SYNs within seconds.

Shodan (passive recon) has already scanned the internet; you query Shodan's database, not the target. Your IP never touches the target. No packet, no log entry. From a network-monitoring defender's perspective, passive recon is undetectable because there is nothing to detect — the scanner that generated Shodan's data ran months or years ago from Shodan's own infrastructure.

When each is appropriate:

Use passive (Shodan) for pre-engagement intelligence gathering when stealth is required, for targets in scope that are particularly sensitive (ICS/SCADA), or when you need historical service data without triggering alarms.
Use active (your tool / nmap) when you need current, authoritative data — Shodan's records may be stale — or when the engagement explicitly authorises active scanning and you need confirmation of what is reachable right now.
In a real engagement, both are used: passive first to build a target map without alerting defenders, then active to confirm and fill gaps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lab 15 — Automation

Requirements

Part 1 — `scanner.py`

How to run

Design choices

Answer — why do false negatives appear at `--rate 2000`?

Part 2 — `parse_scan.py`

How to run

Design choices

Answer — why is a version banner like `Apache httpd 2.4.54` dangerous?

Part 3 — `auth_analysis.py` and `log_analysis.py`

How to generate test logs

How to run

Design choices

Answer — why does the 3-sigma rule fail on periodic traffic?

Part 4 — `recon.py`

How to run

Output structure

Design choices

Answer — active vs. passive reconnaissance (Shodan vs. your tool)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
sample_output		sample_output
README.md		README.md
auth_analysis.py		auth_analysis.py
log_analysis.py		log_analysis.py
parse_scan.py		parse_scan.py
recon.py		recon.py
scanner.py		scanner.py

Folders and files

Latest commit

History

Repository files navigation

Lab 15 — Automation

Requirements

Part 1 — scanner.py

How to run

Design choices

Answer — why do false negatives appear at --rate 2000?

Part 2 — parse_scan.py

How to run

Design choices

Answer — why is a version banner like Apache httpd 2.4.54 dangerous?

Part 3 — auth_analysis.py and log_analysis.py

How to generate test logs

How to run

Design choices

Answer — why does the 3-sigma rule fail on periodic traffic?

Part 4 — recon.py

How to run

Output structure

Design choices

Answer — active vs. passive reconnaissance (Shodan vs. your tool)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Part 1 — `scanner.py`

Answer — why do false negatives appear at `--rate 2000`?

Part 2 — `parse_scan.py`

Answer — why is a version banner like `Apache httpd 2.4.54` dangerous?

Part 3 — `auth_analysis.py` and `log_analysis.py`

Part 4 — `recon.py`

Packages