Skip to content

Add in-repo threat model and point AGENTS.md/SECURITY.md at it#922

Open
potiuk wants to merge 1 commit into
apache:masterfrom
potiuk:asf-security/threat-model-2026-06-06
Open

Add in-repo threat model and point AGENTS.md/SECURITY.md at it#922
potiuk wants to merge 1 commit into
apache:masterfrom
potiuk:asf-security/threat-model-2026-06-06

Conversation

@potiuk
Copy link
Copy Markdown
Member

@potiuk potiuk commented Jun 6, 2026

This is a v0 draft proposal for the Nutch PMC to review — please correct, reject, or discuss as needed. Following up on Lewis's note to go ahead and draft the model so it keeps momentum.

Context. The ASF Security team is preparing the project for an automated agentic security scan we're piloting; the scan runs against a threat model so its output is signal rather than noise. Discoverability already landed in #920; this PR adds the model content.

What's in this PR:

  • THREAT_MODEL.md (new) — a v0 threat model written from your website security model + the codebase, following the threat-model-producer rubric. It is a strict superset of the website security model — nothing there is dropped; the sections the website page didn't cover (adversary model, enumerated properties, known non-findings, triage dispositions) are added and tagged (inferred) for you to confirm / correct / strike. Draft confidence ~10 documented / 22 inferred.
  • AGENTS.md + SECURITY.md — re-pointed so the chain resolves AGENTS.md → SECURITY.md → THREAT_MODEL.md, keeping the website references intact.

The framing to sanity-check: Nutch fetches + parses untrusted web content by design, so crawler "SSRF" (reaching internal/arbitrary URLs) and "parses hostile HTML/XML" are by-design — scoped by your URL filters, not by Nutch refusing. The in-model adversary is the malicious crawled-content supplier (XXE, parser DoS, decompression bombs, ReDoS in URL-filter regex). §11a captures those recurring false positives.

What we'd need from the PMC: walk §14 (3 waves) — a one-line confirm / correct / strike per question is enough; wave 1 (trusted-environment posture incl. nutch-server, crawler-SSRF/scope, the crawled-content adversary) shapes the rest. §14.7 asks whether this in-repo model or the website page should be canonical.

If you'd rather adjust the approach, comment on the PR or close it — entirely your call.

Adds a v0 THREAT_MODEL.md (a superset of the project's website security
model) and re-points the AGENTS.md -> SECURITY.md -> THREAT_MODEL.md
discoverability chain at it, keeping the website references. The threat
model is a provenance-tagged v0 draft for the PMC to review (see the open
questions in its section 14).

Generated-by: Claude Code (Claude Opus 4.8)
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 6, 2026

@lewismc
Copy link
Copy Markdown
Member

lewismc commented Jun 6, 2026

Thank you again @potiuk , the CI failure relates to markdownlint which we can fix easily.
I will study this PR in detail tomorrow. Thank you,

Copy link
Copy Markdown
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @potiuk. Looks good.

See also the inline comments.

@lewismc, let me know who shall polish the PR to final version.

Comment thread THREAT_MODEL.md
Comment thread THREAT_MODEL.md
Comment thread THREAT_MODEL.md
Comment thread THREAT_MODEL.md
2. **Crawler SSRF / scope.** Proposed: fetching internal/arbitrary URLs is
by-design and controlled by operator URL filters, not a Nutch vulnerability;
only *escaping the configured scope* is in-model. Correct? (→ §9, §11a)
3. **Primary adversary = the crawled-content supplier.** Proposed: the main
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed.

In addition, we must not trust and assume that these can be controlled by an attacker:

  • DNS, e.g., an attacker might point the DNS record for attackers-domain.xyz to an IP address not controlled by the attacker, including internal (local, loopback) IP addresses. IP address filters are available for the protocol-okhttp plugin. The filters need to be configured accordingly.
  • HTTP headers, including the "Location" header for redirects.
  • Every message in the protocol stack, including TCP/IP, TLS handshakes, etc. Although Nutch relies on Java core or third-party libraries for these parts.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd support augmenting the model with this exact narrative.

Comment thread THREAT_MODEL.md
@lewismc
Copy link
Copy Markdown
Member

lewismc commented Jun 6, 2026

The Yetus markdownlint issues can be relaxed and addressed elsewhere.

@lewismc
Copy link
Copy Markdown
Member

lewismc commented Jun 6, 2026

@sebastian-nagel
Is there attention to be paid to §8.1 the fetcher.parse and fetcher.store.content properties (although mitigated by default) can also result in corrupted segments in error scenarios. Maybe this is covered within the threat model language but thought I'd double check.

I also think that §8.1 No execution of fetched content needs more thought. I'm fairly sure CVE's have been registered against Tika for code execution from archive files. I may be wrong but thought I'd raise it as well. EDIT I just noticed this is covered by the Decompression/zip bombs stated in §9

Here's my take on §14

Wave 1 — scope & adversary (these shape everything):
Trusted-environment posture / nutch-server. Proposed: the supported posture is "trusted environment only"; an exposed no-auth nutch-server is OUT-OF-MODEL: non-default-build. Correct? (→ §5a, §3, §11a)

Correct

Crawler SSRF / scope. Proposed: fetching internal/arbitrary URLs is by-design and controlled by operator URL filters, not a Nutch vulnerability; only escaping the configured scope is in-model. Correct? (→ §9, §11a)

Correct

Primary adversary = the crawled-content supplier. Proposed: the main in-model attacker is whoever controls fetched content (hostile HTML/XML/feeds/ redirects), and parser robustness against it is the core property. Agree? Any other in-model adversary? (→ §7)

Agree. No further comments, for now.

Wave 2 — properties & parsers: 4. Parser hardening (XXE / bombs). Are XML/feed parsers configured against XXE and decompression/entity bombs by default, or is that operator config? (→ §8, §9)

By default via http.content.limit and CVE fixes in tika parsers. What about parse-zip?

  1. Resource line. Where is the line between an in-model parser-DoS on crafted content and an out-of-model "expensive legitimate crawl"? (→ §8)

This is operator dependent and would need to be assessed on a crawl-by-crawl basis. Instrumentation efforts may improve this understanding and we do have ErrorTracker to assist operators.

  1. Supported plugin set. Which protocol-* / parse-* / indexer-* plugins are first-class for security vs. contrib/unsupported? (→ §2/§3/§5a)

The Nutch project has not concept of contrib. I believe the project considers all official plugins falling into the first class bucket. Any other plugins are out with the project's control.

Wave 3 — meta: 7. Canonicalization. This in-repo THREAT_MODEL.md is drafted as a superset of the website security-model page. Proposed: SECURITY.md points here for the full model, and the website page stays the operator-facing how-to. Agree, or should the website page remain canonical with this as a supplement? (→ meta)

Agree, we can do that easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants