Add in-repo threat model and point AGENTS.md/SECURITY.md at it by potiuk · Pull Request #922 · apache/nutch

potiuk · 2026-06-06T01:32:22Z

This is a v0 draft proposal for the Nutch PMC to review — please correct, reject, or discuss as needed. Following up on Lewis's note to go ahead and draft the model so it keeps momentum.

Context. The ASF Security team is preparing the project for an automated agentic security scan we're piloting; the scan runs against a threat model so its output is signal rather than noise. Discoverability already landed in #920; this PR adds the model content.

What's in this PR:

THREAT_MODEL.md (new) — a v0 threat model written from your website security model + the codebase, following the threat-model-producer rubric. It is a strict superset of the website security model — nothing there is dropped; the sections the website page didn't cover (adversary model, enumerated properties, known non-findings, triage dispositions) are added and tagged (inferred) for you to confirm / correct / strike. Draft confidence ~10 documented / 22 inferred.
AGENTS.md + SECURITY.md — re-pointed so the chain resolves AGENTS.md → SECURITY.md → THREAT_MODEL.md, keeping the website references intact.

The framing to sanity-check: Nutch fetches + parses untrusted web content by design, so crawler "SSRF" (reaching internal/arbitrary URLs) and "parses hostile HTML/XML" are by-design — scoped by your URL filters, not by Nutch refusing. The in-model adversary is the malicious crawled-content supplier (XXE, parser DoS, decompression bombs, ReDoS in URL-filter regex). §11a captures those recurring false positives.

What we'd need from the PMC: walk §14 (3 waves) — a one-line confirm / correct / strike per question is enough; wave 1 (trusted-environment posture incl. nutch-server, crawler-SSRF/scope, the crawled-content adversary) shapes the rest. §14.7 asks whether this in-repo model or the website page should be canonical.

If you'd rather adjust the approach, comment on the PR or close it — entirely your call.

Adds a v0 THREAT_MODEL.md (a superset of the project's website security model) and re-points the AGENTS.md -> SECURITY.md -> THREAT_MODEL.md discoverability chain at it, keeping the website references. The threat model is a provenance-tagged v0 draft for the PMC to review (see the open questions in its section 14). Generated-by: Claude Code (Claude Opus 4.8)

sonarqubecloud · 2026-06-06T01:36:45Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

lewismc · 2026-06-06T01:45:53Z

Thank you again @potiuk , the CI failure relates to markdownlint which we can fix easily.
I will study this PR in detail tomorrow. Thank you,

sebastian-nagel

Thanks, @potiuk. Looks good.

See also the inline comments.

@lewismc, let me know who shall polish the PR to final version.

sebastian-nagel · 2026-06-06T14:50:32Z

+2. **Crawler SSRF / scope.** Proposed: fetching internal/arbitrary URLs is
+   by-design and controlled by operator URL filters, not a Nutch vulnerability;
+   only *escaping the configured scope* is in-model. Correct? (→ §9, §11a)
+3. **Primary adversary = the crawled-content supplier.** Proposed: the main


Yes, agreed.

In addition, we must not trust and assume that these can be controlled by an attacker:

DNS, e.g., an attacker might point the DNS record for attackers-domain.xyz to an IP address not controlled by the attacker, including internal (local, loopback) IP addresses. IP address filters are available for the protocol-okhttp plugin. The filters need to be configured accordingly.

HTTP headers, including the "Location" header for redirects.

Every message in the protocol stack, including TCP/IP, TLS handshakes, etc. Although Nutch relies on Java core or third-party libraries for these parts.

I'd support augmenting the model with this exact narrative.

lewismc · 2026-06-06T21:22:03Z

The Yetus markdownlint issues can be relaxed and addressed elsewhere.

lewismc · 2026-06-06T22:01:27Z

@sebastian-nagel
Is there attention to be paid to §8.1 the fetcher.parse and fetcher.store.content properties (although mitigated by default) can also result in corrupted segments in error scenarios. Maybe this is covered within the threat model language but thought I'd double check.

I also think that §8.1 No execution of fetched content needs more thought. I'm fairly sure CVE's have been registered against Tika for code execution from archive files. I may be wrong but thought I'd raise it as well. EDIT I just noticed this is covered by the Decompression/zip bombs stated in §9

Here's my take on §14

Wave 1 — scope & adversary (these shape everything):
Trusted-environment posture / nutch-server. Proposed: the supported posture is "trusted environment only"; an exposed no-auth nutch-server is OUT-OF-MODEL: non-default-build. Correct? (→ §5a, §3, §11a)

Correct

Crawler SSRF / scope. Proposed: fetching internal/arbitrary URLs is by-design and controlled by operator URL filters, not a Nutch vulnerability; only escaping the configured scope is in-model. Correct? (→ §9, §11a)

Correct

Primary adversary = the crawled-content supplier. Proposed: the main in-model attacker is whoever controls fetched content (hostile HTML/XML/feeds/ redirects), and parser robustness against it is the core property. Agree? Any other in-model adversary? (→ §7)

Agree. No further comments, for now.

Wave 2 — properties & parsers: 4. Parser hardening (XXE / bombs). Are XML/feed parsers configured against XXE and decompression/entity bombs by default, or is that operator config? (→ §8, §9)

By default via http.content.limit and CVE fixes in tika parsers. What about parse-zip?

Resource line. Where is the line between an in-model parser-DoS on crafted content and an out-of-model "expensive legitimate crawl"? (→ §8)

This is operator dependent and would need to be assessed on a crawl-by-crawl basis. Instrumentation efforts may improve this understanding and we do have ErrorTracker to assist operators.

Supported plugin set. Which protocol-* / parse-* / indexer-* plugins are first-class for security vs. contrib/unsupported? (→ §2/§3/§5a)

The Nutch project has not concept of contrib. I believe the project considers all official plugins falling into the first class bucket. Any other plugins are out with the project's control.

Wave 3 — meta: 7. Canonicalization. This in-repo THREAT_MODEL.md is drafted as a superset of the website security-model page. Proposed: SECURITY.md points here for the full model, and the website page stays the operator-facing how-to. Agree, or should the website page remain canonical with this as a supplement? (→ meta)

Agree, we can do that easily.

sebastian-nagel approved these changes Jun 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add in-repo threat model and point AGENTS.md/SECURITY.md at it#922

Add in-repo threat model and point AGENTS.md/SECURITY.md at it#922
potiuk wants to merge 1 commit into
apache:masterfrom
potiuk:asf-security/threat-model-2026-06-06

potiuk commented Jun 6, 2026

Uh oh!

sonarqubecloud Bot commented Jun 6, 2026

Uh oh!

lewismc commented Jun 6, 2026

Uh oh!

sebastian-nagel left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sebastian-nagel Jun 6, 2026

Uh oh!

lewismc Jun 6, 2026

Uh oh!

Uh oh!

lewismc commented Jun 6, 2026

Uh oh!

lewismc commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

potiuk commented Jun 6, 2026

Uh oh!

sonarqubecloud Bot commented Jun 6, 2026

Quality Gate passed

Uh oh!

lewismc commented Jun 6, 2026

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sebastian-nagel Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

lewismc Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lewismc commented Jun 6, 2026

Uh oh!

lewismc commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants