An end-to-end, self-hosted security platform that discovers external infrastructure, builds a live attack graph, correlates each external vulnerability back to the exact source code that deployed it, and generates verified, human-approved code fixes.
The differentiator is the Correlation Engine: where other tools stop at "patch this CVE on this host," SynoSec closes the loop to
UserController.java:42in commita1b2c3d, with cryptographic evidence binding the running container to that code.
- What This Tool Does
- System Overview & Mental Model
- Architecture: The Nine Services
- Data Model (Graph + Relational + Vector)
- End-to-End Data Flow
- The Correlation Engine (Core Differentiator)
- The Agentic Orchestration Layer
- Technology Stack
- Inter-Service Contracts (APIs & Events)
- Security, Posture & Safety Controls
- Development Stages (Build Plan)
- Repository Layout
- Local Development & Deployment
- Testing & Acceptance Gates
- Glossary
SynoSec operates across two domains and bridges them:
External Domain (Infrastructure & Network Mapping)
- Autonomous discovery — an AI-guided scanner traverses authorized infrastructure, identifying systems, running services, software, and technologies.
- Dynamic graph mapping — as nodes are discovered, the system evaluates them for further connections, building a continuous live topology (nodes and edges).
- Vulnerability detection — identifies CVEs, misconfigurations, and weaknesses for each discovered component.
- Attack path chaining — analyzes how isolated vulnerabilities chain across connected nodes to reach deeper systems, producing an attack graph.
Internal Domain (Codebase & Remediation)
- Static code analysis (SAST) — scans the internal codebase for vulnerabilities.
- LLM-driven remediation — combines SAST findings with LLMs to generate code-level fixes.
- Human-in-the-loop (HITL) — surfaces fixes via a UI/PR workflow for approval or rejection.
The Bridge (the reason this tool exists)
- The Correlation Engine links an external attack path to the exact repository, commit, file, and function that produced the vulnerable running service — then triggers a verified patch against that code.
Posture: SynoSec performs continuous, non-exploitative validation. It confirms exploit feasibility (version match + reachability + optional sandboxed proof-of-concept against a digital twin) rather than firing live payloads against production.
Deployment constraint: Zero-egress, fully self-hosted. No prompts, code, scan data, or findings ever leave the customer perimeter. All LLM inference runs on-prem.
The simplest way to reason about SynoSec:
"Outside-in discovery feeds a graph. The graph plus code feeds a correlation. The correlation feeds a verified patch. A human approves the patch."
DISCOVER ──► GRAPH ──► CORRELATE ──► REMEDIATE ──► APPROVE
(external) (Neo4j) (Correlation (LLM patch) (HITL PR)
Engine)
Everything is orchestrated by a LangGraph state machine running a ReAct reasoning agent plus a Verifier agent that gates every fact before it enters the graph. Tools (scanners, SAST engines, SBOM tools) are deterministic; the LLM only plans and explains — it never commits an unverified fact.
All services run as independently deployable containers, communicate over an internal gRPC/Protobuf mesh (mTLS), and emit events to a Kafka/Redpanda bus. The entire platform ships as a single Helm chart for air-gapped operation.
┌────────────────────────────────────────────────┐
│ Orchestration Plane │
│ ┌────────────────┐ ┌─────────────────────┐ │
│ │ LangGraph ReAct│ │ Verifier Agent │ │
│ │ Coordinator │◀▶│ (hallucination gate)│ │
│ └───────┬────────┘ └─────────────────────┘ │
└──────────┼─────────────────────────────────────┘
│
┌────────────────────────────┼────────────────────────────────────┐
│ ▼ │
┌──┴────────────┐ ┌──────────────────────┐ ┌──────────────────────┐ ┌┴─────────────┐
│ Discovery │ │ Vulnerability │ │ SAST Pipeline │ │ LLM │
│ Engine │ │ Intelligence │ │ (Semgrep + CodeQL) │ │ Orchestrator │
│ │ │ │ │ │ │ + Patch Gen │
└───────┬───────┘ └──────────┬───────────┘ └──────────┬───────────┘ └──────┬───────┘
│ │ │ │
▼ ▼ ▼ ▼
┌────────────────────────────────────────────────────────────────────────────────┐
│ CORRELATION ENGINE │
│ Deterministic Joiner │ Fingerprint Matcher │ Confidence Scorer │ Attestation │
│ Verifier │
└────────────────────────────────┬───────────────────────────────────────────────┘
│
┌────────────────────────┼────────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌──────────────────┐ ┌───────────────────┐
│ Neo4j │ │ Qdrant (RAG) │ │ PostgreSQL + │
│ (Topology + │ │ + Code Embeddings│ │ pgvector │
│ Attack Graph)│ │ │ │ (Findings, Audit) │
└───────────────┘ └──────────────────┘ └───────────────────┘
│
▼
┌────────────────────────────────────────────┐
│ Reporting + HITL UI (React) + PR Bot │
└────────────────────────────────────────────┘
Responsibility: Continuously enumerate the authorized external attack surface and internal assets.
Implements: subfinder → dnsx → naabu → httpx → katana pipeline; BloodHound CE for Active Directory; read-only cloud inventory collectors (CloudQuery/Steampipe).
Emits: AssetDiscovered events on discovery.events.
Constraint: Every tool invocation is checked against a signed Engagement scope object (CIDR + domain allowlist, rate ceilings, blackout windows, active/passive mode). Out-of-scope calls are refused at the orchestration layer.
Responsibility: Match discovered components to CVEs, KEV, EPSS, and misconfiguration templates.
Implements: Nuclei (templates), Grype (CVE matching from SBOM), Trivy (OS/IaC/secrets/K8s breadth). Maintains a local mirror of NVD, GHSA, OSV, EPSS daily snapshot, CISA KEV (zero-egress).
Key rule: Version-aware pre-filter — match CVEs against the actual deployed software version before writing to the graph, eliminating the bulk of version-irrelevant CVE noise.
Emits: VulnerabilityMatched events.
Responsibility: Scan internal repositories at PR time and nightly.
Implements: Semgrep (fast PR feedback) + CodeQL (deep nightly taint analysis) + Joern (high-precision specialty pass). Reachability layer fuses CodeQL call graph + SVF (C/C++) + SootUp (Java) to drop dead-code findings.
Emits: CodeFinding{repo, commit, file, line, rule_id, sink_kind, taint_sources, reachability_status, confidence}.
Responsibility: Turn correlated findings into developer-quality patches and generate reports. Implements: Self-hosted vLLM serving a coder model (e.g., Qwen2.5-Coder-32B) and a reasoner model (e.g., Llama 3.3 70B). RAG context from Qdrant. Patch verification (pre-human): Every generated patch is re-run through Semgrep + CodeQL diff query + project unit tests in an ephemeral sandbox before any human sees it. Failed patches are dropped and logged as RAG-negative examples.
The centerpiece — see Section 6.
Stores live topology, attack graph, and external↔internal correlations as a property graph. GDS library computes centrality and shortest paths.
RAG index for code chunks, framework docs, historical accepted patches, CVE descriptions. Embeddings served on-prem via Text-Embeddings-Inference (bge-large-en-v1.5 or nomic-embed-text).
Authoritative ACID system of record for findings, audit trails, HITL decisions, patch history. pgvector column used for semantic dedup ("is this the same finding as last week?").
React/Next.js frontend over a GraphQL gateway. A GitHub/GitLab/Bitbucket bot opens PRs containing: the attack path, the correlation evidence, the SAST rule, the diff, and sandbox test results — with one-click approve/reject. Rejections flow back into the patch-history RAG index.
Node labels:
Asset, Service, Software, CVE, Misconfiguration, Repository, Commit, File, Function, Image, Pipeline, Identity, Finding, AttackPathStep.
Core relationships:
| Edge | From → To | Meaning |
|---|---|---|
RUNS |
Asset → Service | Asset hosts a service |
LISTENS_ON |
Service → Port | Service exposes a port |
CONNECTS_TO |
Asset → Asset | Network reachability |
HAS_CVE |
Software → CVE | Component is vulnerable |
BUILT_FROM |
Image → Commit | The key external↔internal edge |
DEFINED_IN |
Service → Repository | Service's source repo |
DEPLOYED_BY |
Asset → Pipeline | CI/CD that deployed it |
CHAINS_TO |
AttackPathStep → AttackPathStep | Vulnerability chaining |
FIXED_BY |
Finding → Patch | Remediation link |
Enrichment properties on attack-path nodes (used for prioritization): cvss, epss, kev (bool), node_centrality, reachability_from_entry, asset_criticality.
findings(id, tenant_id, type, severity, status, asset_ref, repo_ref,
commit_sha, file_path, line, rule_id, confidence,
correlation_evidence JSONB, created_at, dedup_vector vector(1024))
hitl_decisions(id, finding_id, decision, reason, actor, decided_at)
patch_history(id, finding_id, diff, verifier_result JSONB, accepted BOOL,
feedback_text, created_at)
audit_ledger(id, prev_hash, payload JSONB, hash, created_at) -- append-only, hash-chained
engagements(id, tenant_id, scope JSONB, rate_limits JSONB,
blackout_windows JSONB, mode, signature, signed_by)code_embeddings, patch_history, cve_descriptions, framework_docs — each with payload filters (tenant_id, repo, language) for hybrid filter+vector search.
Engagement (signed scope)
│
▼
LangGraph Coordinator ── ReAct loop ──► Tool nodes (Nmap/Nuclei/Semgrep/...)
│ │ │
│ ▼ ▼
│ Verifier node ◀──────── Tool output
│ │
│ ▼ (only verified facts)
│ Neo4j upserts
│
▼
Correlation Engine fuses signals ──► Confidence-scored matches ──► PostgreSQL findings
│
▼
For each AUTO_LINK finding:
LLM Orchestrator ──► RAG retrieve (Qdrant) ──► Patch draft
│
▼
Patch Verifier (Semgrep+CodeQL+tests, sandbox)
│
▼
HITL PR Bot ──► developer accept/reject
│
▼
Outcome feedback ──► Qdrant patch_history (RAG)
Flow, narrated:
- Discovery → Graph. Discovery Engine emits
AssetDiscovered. The Verifier confirms each fact (re-grab banner, re-resolve DNS) before the Coordinator upserts nodes into Neo4j. Unverified LLM-asserted edges are quarantined, never committed. - Vuln Intel → Graph. CVEs/misconfigs matched (version-filtered) and written as
HAS_CVEedges withcvss/epss/kev. - Chaining → Attack Graph. k-shortest-path computation from declared entry points (internet, partner, insider) to crown-jewel assets, scored by
CVSS × EPSS × (1 + reachability_bonus) / mitigation_score. Never enumerate all paths (path-explosion defense). - Correlation. For each top-ranked vulnerable service, the Correlation Engine fuses signals to find the exact repo/commit/file (see §6).
- Remediation. For
AUTO_LINKfindings, the LLM Orchestrator generates a patch with RAG context; the Patch Verifier re-checks it in a sandbox. - Approval. The PR bot surfaces evidence + diff; the human approves or rejects; the outcome feeds back into the RAG index (the only "learning" loop in a zero-egress deployment).
The Correlation Engine answers one question with high accuracy and low false-positive rate:
"This vulnerable running service — which exact line of source code, in which commit, produced it?"
It is a confidence-scored, multi-signal join, not a single ML model. Signals are layered from deterministic (cryptographic) down to probabilistic (learned).
| Tier | Signal | Trust | How obtained |
|---|---|---|---|
| 1. Cryptographic | SLSA in-toto provenance verified via Sigstore/Cosign | 1.00 | cosign verify-attestation; slsa-verifier verify-image |
| 2. Declarative | OCI labels: org.opencontainers.image.source, .revision |
0.85 | docker buildx imagetools inspect --raw |
| 3. Declarative | SBOM (Syft) with package PURLs and source hints | 0.80 | syft <image> -o cyclonedx-json |
| 4. Pipeline | CI job ID / run URL stored as image label | 0.75 | Parsed from build-time labels |
| 5. Deployment | K8s annotations, ArgoCD app source, Helm values | 0.70 | K8s API + ArgoCD CRD watch |
| 6. IaC | Terraform state → module → resource → image tag | 0.65 | Terraform state parsing |
| 7. Network | Banner, JARM, HTTP tech fingerprint, favicon hash | 0.40–0.65 | httpx + Nuclei tech-detect |
| 8. Learned | Source structure ↔ runtime endpoint (TF-IDF candidates → BERT pairwise classifier) | 0.30–0.85 | Two-stage retrieval + classification |
| 9. AST/symbol | "Hot-spot" unique class/route names matched to deployed bytecode/JS | 0.50–0.80 | tree-sitter AST extraction, JS de-bundling, JVM class extraction |
Each signal s_i yields (weight_i, score_i ∈ [0,1]). Fuse with noisy-OR + contradiction penalty:
def fuse(signals):
p = 1.0
for w, s in signals:
p *= (1 - w * s)
p_match = 1 - p
return p_match * contradiction_penalty(signals)
# contradiction_penalty drops to 0.5 if two signals point at different
# repos beyond tolerance → emit a "provenance conflict" finding,
# NEVER silently pick one.Decision thresholds:
| Band | Condition | Action |
|---|---|---|
| AUTO_LINK | adjusted ≥ 0.95 and ≥1 Tier-1 signal |
Used directly in patch generation |
| PROPOSED_LINK | 0.70 ≤ adjusted < 0.95 |
One-click human confirm |
| CANDIDATE | adjusted < 0.70 |
Graph annotation only; never patched |
Hard rule: The LLM patch generator never operates on a correlation below
AUTO_LINKwithout explicit HITL approval. This single gate prevents most "patched the wrong file" errors.
- Discovery:
api.example.com:443runs Spring Boot 2.7.4. Nuclei flags CVE-2022-22965 (Spring4Shell, CVSS 9.8, EPSS 0.97, KEV). - Chaining: Verifier confirms the reverse proxy actually proxies the vulnerable endpoint (non-exploitative HEAD/OPTIONS probe). Centrality + EPSS×CVSS re-rank this path to the top.
- Correlation:
- K8s API → pod image digest
sha256:abc…. cosign verify-attestation→ SLSA provenance: sourcegithub.com/octo/spring-app, commita1b2c3d. Tier-1, score 1.00.docker buildx imagetools inspect→org.opencontainers.image.revision=a1b2c3d. Tier-2 corroborates.- Syft SBOM →
spring-webmvc@5.3.23. Tier-3 corroborates. - Fusion →
1.0→ AUTO_LINK.
- K8s API → pod image digest
- Code anchoring: Correlation Engine asks SAST: "in
octo/spring-app@a1b2c3d, find public Spring controllers using the vulnerable DataBinder pattern." CodeQL returnssrc/main/java/com/octo/web/UserController.java:42; reachability confirms it's reachable fromDispatcherServlet. - Patch: Coder LLM gets the snippet + CodeQL alert + Spring advisory (RAG) + 3 historical accepted fixes. Produces a diff (
@InitBinderwithsetDisallowedFields(...)or dependency upgrade). Verifier re-runs Semgrep+CodeQL+tests in sandbox — all pass. - PR: Bot opens "Fix CVE-2022-22965 in UserController (reachable from internet via api.example.com)" with attack-path SVG, 4-signal correlation evidence (confidence 1.0), diff, test results, approve/reject.
Result: a perimeter Spring4Shell CVE resolved with a verified patch against an exact line, with cryptographic evidence binding the container to that code.
- Provenance conflict detector — disagreeing Tier-1/2 signals → surface as a supply-chain finding, don't auto-link.
- Sigstore Rekor cross-check — verify inclusion proof offline against a pinned trusted root; fail closed if missing.
- Signed-SBOM requirement — trust an SBOM only if attached as a signed in-toto attestation.
- Reachability gate — no patch unless the vulnerable symbol is reachable from a public entry point.
- Verifier agent on every LLM-asserted edge — re-validate with a deterministic tool call before committing to Neo4j.
Framework: LangGraph (state-machine + durable execution + checkpointing).
Design rules (security-critical):
- Planner/Executor split. The planner LLM has no tool access. The executor LLM has tools but is constrained to JSON I/O and an enforced allowlist. (Agent isolation is the single strongest prompt-injection defense.)
- JSON-structured tool I/O. All external content (banners, scraped HTML, CVE text) is wrapped in delimiters and never concatenated into the system-prompt zone.
- Bounded budget. Max tool calls and max wall-clock per goal (e.g., 50 calls / 20 min) prevents runaway agents.
- Tool allowlist at the graph level. The LLM cannot invoke an out-of-scope tool regardless of what it generates.
- Checkpoint every transition to Postgres so interrupted scans resume deterministically.
- No silent state-mutating tool calls. Any tool that opens a PR, writes a file, or changes a firewall rule requires HITL approval.
The Verifier agent (separate, smaller/cheaper model or deterministic rules) re-checks every claim before it becomes a graph fact:
- Numeric/factual claim → must invoke a tool to confirm (re-probe banner, re-resolve DNS, re-query inventory).
- Low-confidence edges → quarantine queue, never the live graph.
| Need | Choice | Notes |
|---|---|---|
| Port scanning | Naabu (+ Nmap second pass for OS fingerprint) | Default rate capped low (e.g., 200 pps) for safety |
| Subdomain enum | Subfinder (+ periodic Amass) | Passive-first |
| HTTP fingerprint | Httpx | Tech detection |
| Web crawling | Katana | Security-recon crawler |
| Template vuln scan | Nuclei (+ OWASP ZAP for authenticated DAST) | 9,000+ community templates |
| AD/identity | BloodHound CE (Neo4j-backed) | AD attack paths |
| Cloud inventory | CloudQuery + Steampipe (read-only IAM) | Normalizes to Postgres |
| Need | Choice | Notes |
|---|---|---|
| PR-time SAST | Semgrep | Fast; filter aggressively |
| Deep nightly SAST | CodeQL | Deepest open-source data-flow |
| High-precision pass | Joern | Low FPR specialty |
| Reachability | CodeQL + SVF (C/C++) + SootUp (Java) | Multi-tool fusion |
| FP filter (Phase 4) | LLM agent over SAST output | Empirically cuts SAST false positives dramatically |
| Need | Choice |
|---|---|
| SBOM generation | Syft (+ Trivy for breadth) |
| Vuln matching | Grype + Trivy (different DB coverage) |
| Image signing/attestation | Cosign + Sigstore (offline verify for air-gap) |
| SLSA verify | slsa-verifier |
| Continuous SBOM monitoring | Dependency-Track |
| Need | Choice | Notes |
|---|---|---|
| Orchestration | LangGraph | State machine + checkpointing |
| LLM serving | vLLM (prod), Ollama (dev/single-box) | Air-gap friendly |
| Reasoner LLM | Llama 3.3 70B or DeepSeek-V3 (quantized) | Open-weight |
| Coder LLM | Qwen2.5-Coder-32B or DeepSeek-Coder-V2 | Open-weight |
| Embeddings | bge-large-en-v1.5 / nomic-embed-text | Self-hosted via TEI |
| Vector DB | Qdrant | Single binary, payload filters |
| Graph DB | Neo4j Community (Memgraph/NebulaGraph swap >10M nodes) | GDS library |
| Findings DB | PostgreSQL + pgvector | ACID + dedup |
| Blob store | MinIO (S3-compatible) | Immutable scan output, SBOMs |
| Bus | Kafka / Redpanda | Event backbone |
| Metrics | Prometheus + Grafana |
MulVAL (logical attack graph, polynomial), CAULDRON/TVA (exploit-dependency model), MITRE CALDERA (ATT&CK emulation plug-in), PentAGI / CAI / Strix (open-source agent references — none of which correlate to source code, confirming the gap SynoSec fills).
All services expose gRPC (Protobuf) and publish/consume Kafka events. Representative schemas:
Event: AssetDiscovered (topic discovery.events)
{
"asset_id": "uuid", "tenant_id": "uuid",
"ip": "203.0.113.10", "port": 443, "proto": "https",
"service": "spring-boot", "version": "2.7.4",
"banner": "...", "tls_jarm": "...", "fingerprint": {"...": "..."},
"hashes": {"favicon": "..."}, "source_tool": "httpx",
"timestamp": "2026-06-08T12:00:00Z", "confidence": 0.92
}Event: VulnerabilityMatched (topic vuln.events)
{
"asset_id": "uuid", "cve_id": "CVE-2022-22965",
"cvss": 9.8, "epss": 0.97, "kev": true,
"version_match": true, "source_tool": "nuclei"
}Event: CodeFinding (topic sast.events)
{
"repo": "github.com/octo/spring-app", "commit": "a1b2c3d",
"file": "src/main/java/com/octo/web/UserController.java", "line": 42,
"rule_id": "java/spring-disallowed-fields", "sink_kind": "data-binder",
"taint_sources": ["request.param"], "reachability_status": "reachable",
"confidence": 0.91
}gRPC: CorrelationService.Correlate
rpc Correlate(CorrelateRequest) returns (CorrelateResponse);
message CorrelateRequest { string asset_id = 1; string image_digest = 2; }
message CorrelateResponse {
string repo = 1; string commit = 2;
float confidence = 3;
Band band = 4; // AUTO_LINK | PROPOSED_LINK | CANDIDATE
repeated Signal signals = 5; // evidence trail
bool provenance_conflict = 6;
}gRPC: RemediationService.GeneratePatch → returns unified diff + justification + self-grade + verifier result.
- Signed
Engagementobject declares scope, per-tool rate ceilings, blackout windows, and active/passive mode. - Two modes:
PASSIVE(banner/DNS/cert-transparency/SBOM only) andACTIVE_SAFE(port scan with adaptive backoff, Nuclei filtered tosafe=true, no exploitation templates). - Rate limiting: per-target token bucket; global circuit breaker halves rate if 5xx exceeds threshold.
- Validation, not exploitation: version match + reachability + optional sandboxed PoC against a customer-provided digital twin — never against production.
- Change-window/freeze flags honored from the customer CMDB.
- Zero-egress: all inference on-prem; nothing leaves the perimeter.
- Prompt-injection defense-in-depth: planner/executor isolation, JSON-structured I/O, an input prompt-injection classifier, an outbound egress allowlist on agent containers, and no silent state-mutating tool calls.
- Sandboxing: every tool runs in an ephemeral container — no host mounts, no cloud credentials, gVisor/Kata kernel isolation, egress restricted to engagement scope.
- Audit trail: every prompt, tool call, output, graph mutation, and HITL decision hash-chained into an append-only Postgres ledger.
- Data retention: per-tenant AES-GCM at rest (Vault/KMS). Defaults: scan data 90 days, findings indefinite, LLM I/O 30 days (configurable to ephemeral).
| Tier | Scope | Behavior |
|---|---|---|
| 0 | Read-only PASSIVE scans, allowlisted assets | Auto-execute |
| 1 | ACTIVE_SAFE scans, SAST, draft patch generation | Auto-execute + audit |
| 2 | PR creation, sub-AUTO_LINK edge commits, sandboxed PoC | Approve-before-execute |
| 3 | Scope changes, tool-allowlist changes, model upgrades | Two-person approval |
- Discovery sharding by tenant namespace with Redis distributed token buckets.
- Path-explosion defense: k-shortest paths on the attack subgraph, never full enumeration; logical-attack-graph model (polynomial in network size).
- Prioritization scoring:
CVSS × EPSS × (1 + reachability_bonus) × asset_criticality × centrality / mitigation_score. EPSS daily-refreshed. - Graph scale: Neo4j Community to ~10M nodes; Memgraph/NebulaGraph migration path beyond.
- LLM cost control: cheap Verifier model for ~80% of checks; reserve the large reasoner for novel/low-confidence correlations; cache RAG retrievals.
Build the deterministic path first. The learned matcher (Phase 4) can lag without breaking the product thesis — deterministic SLSA/OCI/SBOM joins already exceed competitor accuracy.
Goal: discover → graph, end to end, safely.
- Stand up Discovery Engine (Naabu/Subfinder/Httpx/Katana/Nuclei) inside a LangGraph orchestrator with strict scope enforcement.
- Stand up SBOM pipeline (Syft + Grype + Trivy) and Neo4j topology store.
- Define the property-graph schema (Asset, Service, Software, CVE, Repository, Commit, Image, Finding).
- Stand up Kafka bus, Postgres findings store, MinIO blob store.
- Gate: scan a synthetic 1,000-asset environment end-to-end; graph populated; zero production-impact incidents.
Goal: generate verified patches behind a HITL PR.
- Integrate Semgrep (PR-time) + CodeQL (nightly).
- Stand up vLLM + coder model; build RAG over Qdrant with repo embeddings.
- Build the HITL PR bot (GitHub/GitLab apps).
- Implement the Patch Verifier (sandboxed Semgrep+CodeQL+tests before PR open).
- Gate: ≥70% of generated patches pass the Verifier sandbox on a curated corpus; ≥40% accepted by developers in pilot.
Goal: ship the moat using cryptographic + declarative signals.
- Implement Tier-1 → Tier-6 signals: Sigstore/SLSA verify, OCI label parsing, SBOM-PURL joins, CI metadata, K8s/ArgoCD parsing, Terraform-state parsing.
- Implement the confidence-scoring engine (noisy-OR + contradiction penalty).
- Wire AUTO_LINK / PROPOSED_LINK / CANDIDATE gating into PR generation.
- Gate: ≥95% of AUTO_LINK matches correct on developer review; zero false patches merged.
Goal: cover customers lacking provenance; add chaining.
- Build the two-stage fingerprint matcher (TF-IDF candidates → BERT pairwise) for Java/JS/Python/Go.
- Implement attack-graph chaining (k-shortest path) with EPSS×CVSS×centrality×reachability scoring.
- Add the LLM-based SAST false-positive filter.
- Gate: customer reports "found a path competitors missed" or "explained a CVE with the exact commit" within 30 days post-deploy.
Goal: production hardening + on-prem learning loop.
- Verifier agent on every LLM-asserted graph mutation; track hallucination metrics.
- Sandboxed PoC against customer digital twins (opt-in).
- Memgraph/NebulaGraph migration option for >10M-node tenants.
- Per-tenant feedback RAG (accept/reject) closes the learning loop.
Replan triggers:
- Correlation precision < 90% in Stage 3 → delay Stage 4; invest in declarative signal coverage (enforce OCI labels via admission control).
- Patch acceptance < 30% in Stage 2 → deepen the Verifier sandbox before scaling.
- Any scan-related production incident → halt ACTIVE_SAFE, revert to PASSIVE until root cause shipped.
- Graph query latency > 5s at customer scale → subgraph caching or Memgraph migration before adding enrichment.
synosec/
├── README.md
├── deploy/
│ ├── helm/ # single air-gapped Helm chart
│ └── docker-compose.dev.yml # local single-box (Ollama, Neo4j, Qdrant, PG)
├── proto/ # shared gRPC/Protobuf contracts
├── services/
│ ├── discovery-engine/ # Naabu/Subfinder/Httpx/Katana/BloodHound wrappers
│ ├── vuln-intel/ # Nuclei/Grype/Trivy + local NVD/EPSS/KEV mirror
│ ├── sast-pipeline/ # Semgrep/CodeQL/Joern + reachability fusion
│ ├── llm-orchestrator/ # vLLM client, patch gen, RAG, patch verifier
│ ├── correlation-engine/ # signal collectors, fusion scorer, verifiers
│ ├── orchestration/ # LangGraph planner/executor/verifier graphs
│ ├── reporting-ui/ # React/Next.js + GraphQL gateway + PR bot
│ └── common/ # auth (mTLS), audit ledger, engagement scope
├── data/
│ ├── neo4j-schema/ # constraints, indexes, GDS projections
│ ├── postgres-migrations/
│ └── qdrant-collections/
├── agents/
│ ├── planner/ executor/ verifier/ # prompts (JSON-schema), tool allowlists
└── test/
├── synthetic-lab/ # 1,000-asset env, 6-VM PoC, digital twins
└── corpora/ # patch corpus, OWASP Benchmark harness
Single-box dev (docker-compose.dev.yml): Ollama (coder + reasoner), Neo4j Community, Qdrant, PostgreSQL+pgvector, MinIO, Redpanda. Discovery tools as sidecar containers. Bring up:
docker compose -f deploy/docker-compose.dev.yml up -d
make seed-synthetic-lab # spins up the 1,000-asset test environment
make run-engagement SCOPE=test/synthetic-lab/scope.jsonProduction (air-gapped): helm install synosec deploy/helm -f values.airgap.yaml. All model weights, vuln DBs (NVD/GHSA/OSV/EPSS/KEV), and container images are mirrored into the customer registry beforehand. No outbound network required at runtime.
GPU sizing: reasoner (70B quantized) ≈ 1–2× A100/H100; coder (32B) ≈ 1× A100; embeddings on CPU or shared GPU. Verifier uses the smaller/cheaper model.
- Unit + contract tests per service against the Protobuf schemas.
- Synthetic lab (
test/synthetic-lab/) — a controlled 1,000-asset environment with known vulnerabilities and a known repo↔image mapping, used to measure correlation precision/recall. - OWASP Benchmark harness — measures SAST + LLM-filter false-positive reduction.
- Patch corpus — measures Verifier pass rate and developer acceptance.
- Posture tests — assert that out-of-scope targets are refused, rate ceilings hold, and ACTIVE_SAFE never fires exploitation templates.
- Hallucination metric — sampled LLM-as-judge spot-check on graph edges and patches; tracked as a top-line product metric.
Definition of done per stage = the stage gate in Section 11.
| Term | Meaning |
|---|---|
| Attack graph | Graph of how chained vulnerabilities let an attacker move from an entry point to a target |
| AUTO_LINK | Correlation confidence ≥ 0.95 with a cryptographic signal; safe to patch automatically (still HITL-approved) |
| Correlation Engine | The service that links an external vulnerability to the exact source code |
| HITL | Human-in-the-loop — required human approval before state-changing actions |
| Non-exploitative validation | Confirming exploit feasibility without firing live payloads at production |
| Provenance (SLSA/in-toto) | Cryptographic attestation binding a built artifact to its source commit and builder |
| ReAct | Reasoning + Acting agent loop (think → call tool → observe → repeat) |
| Reachability | Whether a vulnerable code symbol is actually callable from a public entry point |
| Verifier agent | The component that re-checks every LLM-asserted fact deterministically before it enters the graph |
| Zero-egress | No data leaves the customer perimeter; all inference and storage on-prem |
This README is the authoritative internal build specification. Build deterministic before learned, validate before exploit, and never let an unverified LLM assertion become a committed fact.