Add in-repo threat model and point AGENTS.md/SECURITY.md at it#922
Add in-repo threat model and point AGENTS.md/SECURITY.md at it#922potiuk wants to merge 1 commit into
Conversation
Adds a v0 THREAT_MODEL.md (a superset of the project's website security model) and re-points the AGENTS.md -> SECURITY.md -> THREAT_MODEL.md discoverability chain at it, keeping the website references. The threat model is a provenance-tagged v0 draft for the PMC to review (see the open questions in its section 14). Generated-by: Claude Code (Claude Opus 4.8)
|
|
Thank you again @potiuk , the CI failure relates to markdownlint which we can fix easily. |
| 2. **Crawler SSRF / scope.** Proposed: fetching internal/arbitrary URLs is | ||
| by-design and controlled by operator URL filters, not a Nutch vulnerability; | ||
| only *escaping the configured scope* is in-model. Correct? (→ §9, §11a) | ||
| 3. **Primary adversary = the crawled-content supplier.** Proposed: the main |
There was a problem hiding this comment.
Yes, agreed.
In addition, we must not trust and assume that these can be controlled by an attacker:
- DNS, e.g., an attacker might point the DNS record for
attackers-domain.xyzto an IP address not controlled by the attacker, including internal (local, loopback) IP addresses. IP address filters are available for theprotocol-okhttpplugin. The filters need to be configured accordingly. - HTTP headers, including the "Location" header for redirects.
- Every message in the protocol stack, including TCP/IP, TLS handshakes, etc. Although Nutch relies on Java core or third-party libraries for these parts.
There was a problem hiding this comment.
I'd support augmenting the model with this exact narrative.
|
The Yetus markdownlint issues can be relaxed and addressed elsewhere. |
|
@sebastian-nagel I also think that §8.1 No execution of fetched content needs more thought. I'm fairly sure CVE's have been registered against Tika for code execution from archive files. I may be wrong but thought I'd raise it as well. EDIT I just noticed this is covered by the Decompression/zip bombs stated in §9 Here's my take on §14
Correct
Correct
Agree. No further comments, for now.
By default via http.content.limit and CVE fixes in tika parsers. What about parse-zip?
This is operator dependent and would need to be assessed on a crawl-by-crawl basis. Instrumentation efforts may improve this understanding and we do have ErrorTracker to assist operators.
The Nutch project has not concept of
Agree, we can do that easily. |



This is a v0 draft proposal for the Nutch PMC to review — please correct, reject, or discuss as needed. Following up on Lewis's note to go ahead and draft the model so it keeps momentum.
Context. The ASF Security team is preparing the project for an automated agentic security scan we're piloting; the scan runs against a threat model so its output is signal rather than noise. Discoverability already landed in #920; this PR adds the model content.
What's in this PR:
THREAT_MODEL.md(new) — a v0 threat model written from your website security model + the codebase, following the threat-model-producer rubric. It is a strict superset of the website security model — nothing there is dropped; the sections the website page didn't cover (adversary model, enumerated properties, known non-findings, triage dispositions) are added and tagged (inferred) for you to confirm / correct / strike. Draft confidence ~10 documented / 22 inferred.AGENTS.md+SECURITY.md— re-pointed so the chain resolvesAGENTS.md → SECURITY.md → THREAT_MODEL.md, keeping the website references intact.The framing to sanity-check: Nutch fetches + parses untrusted web content by design, so crawler "SSRF" (reaching internal/arbitrary URLs) and "parses hostile HTML/XML" are by-design — scoped by your URL filters, not by Nutch refusing. The in-model adversary is the malicious crawled-content supplier (XXE, parser DoS, decompression bombs, ReDoS in URL-filter regex). §11a captures those recurring false positives.
What we'd need from the PMC: walk §14 (3 waves) — a one-line confirm / correct / strike per question is enough; wave 1 (trusted-environment posture incl.
nutch-server, crawler-SSRF/scope, the crawled-content adversary) shapes the rest. §14.7 asks whether this in-repo model or the website page should be canonical.If you'd rather adjust the approach, comment on the PR or close it — entirely your call.