SynoSec — AI-Driven Continuous Attack Surface Validation & Remediation Platform

An end-to-end, self-hosted security platform that discovers external infrastructure, builds a live attack graph, correlates each external vulnerability back to the exact source code that deployed it, and generates verified, human-approved code fixes.

The differentiator is the Correlation Engine: where other tools stop at "patch this CVE on this host," SynoSec closes the loop to UserController.java:42 in commit a1b2c3d, with cryptographic evidence binding the running container to that code.

What This Tool Does
System Overview & Mental Model
Architecture: The Nine Services
Data Model (Graph + Relational + Vector)
End-to-End Data Flow
The Correlation Engine (Core Differentiator)
The Agentic Orchestration Layer
Technology Stack
Inter-Service Contracts (APIs & Events)
Security, Posture & Safety Controls
Development Stages (Build Plan)
Repository Layout
Local Development & Deployment
Testing & Acceptance Gates
Glossary

1. What This Tool Does

SynoSec operates across two domains and bridges them:

External Domain (Infrastructure & Network Mapping)

Autonomous discovery — an AI-guided scanner traverses authorized infrastructure, identifying systems, running services, software, and technologies.
Dynamic graph mapping — as nodes are discovered, the system evaluates them for further connections, building a continuous live topology (nodes and edges).
Vulnerability detection — identifies CVEs, misconfigurations, and weaknesses for each discovered component.
Attack path chaining — analyzes how isolated vulnerabilities chain across connected nodes to reach deeper systems, producing an attack graph.

Internal Domain (Codebase & Remediation)

Static code analysis (SAST) — scans the internal codebase for vulnerabilities.
LLM-driven remediation — combines SAST findings with LLMs to generate code-level fixes.
Human-in-the-loop (HITL) — surfaces fixes via a UI/PR workflow for approval or rejection.

The Bridge (the reason this tool exists)

The Correlation Engine links an external attack path to the exact repository, commit, file, and function that produced the vulnerable running service — then triggers a verified patch against that code.

Posture: SynoSec performs continuous, non-exploitative validation. It confirms exploit feasibility (version match + reachability + optional sandboxed proof-of-concept against a digital twin) rather than firing live payloads against production.

Deployment constraint: Zero-egress, fully self-hosted. No prompts, code, scan data, or findings ever leave the customer perimeter. All LLM inference runs on-prem.

2. System Overview & Mental Model

The simplest way to reason about SynoSec:

"Outside-in discovery feeds a graph. The graph plus code feeds a correlation. The correlation feeds a verified patch. A human approves the patch."

DISCOVER  ──►  GRAPH  ──►  CORRELATE  ──►  REMEDIATE  ──►  APPROVE
(external)    (Neo4j)     (Correlation     (LLM patch)    (HITL PR)
                           Engine)

Everything is orchestrated by a LangGraph state machine running a ReAct reasoning agent plus a Verifier agent that gates every fact before it enters the graph. Tools (scanners, SAST engines, SBOM tools) are deterministic; the LLM only plans and explains — it never commits an unverified fact.

3. Architecture: The Nine Services

All services run as independently deployable containers, communicate over an internal gRPC/Protobuf mesh (mTLS), and emit events to a Kafka/Redpanda bus. The entire platform ships as a single Helm chart for air-gapped operation.

                     ┌────────────────────────────────────────────────┐
                     │              Orchestration Plane               │
                     │  ┌────────────────┐  ┌─────────────────────┐   │
                     │  │ LangGraph ReAct│  │  Verifier Agent     │   │
                     │  │   Coordinator  │◀▶│ (hallucination gate)│   │
                     │  └───────┬────────┘  └─────────────────────┘   │
                     └──────────┼─────────────────────────────────────┘
                                │
   ┌────────────────────────────┼────────────────────────────────────┐
   │                            ▼                                    │
┌──┴────────────┐ ┌──────────────────────┐ ┌──────────────────────┐ ┌┴─────────────┐
│ Discovery     │ │ Vulnerability        │ │ SAST Pipeline        │ │ LLM          │
│ Engine        │ │ Intelligence         │ │ (Semgrep + CodeQL)   │ │ Orchestrator │
│               │ │                      │ │                      │ │ + Patch Gen  │
└───────┬───────┘ └──────────┬───────────┘ └──────────┬───────────┘ └──────┬───────┘
        │                    │                        │                    │
        ▼                    ▼                        ▼                    ▼
┌────────────────────────────────────────────────────────────────────────────────┐
│                            CORRELATION ENGINE                                  │
│   Deterministic Joiner │ Fingerprint Matcher │ Confidence Scorer │ Attestation │
│                                                                    Verifier    │
└────────────────────────────────┬───────────────────────────────────────────────┘
                                 │
        ┌────────────────────────┼────────────────────────┐
        ▼                        ▼                        ▼
┌───────────────┐      ┌──────────────────┐      ┌───────────────────┐
│  Neo4j        │      │ Qdrant (RAG)     │      │ PostgreSQL +      │
│ (Topology +   │      │ + Code Embeddings│      │ pgvector          │
│  Attack Graph)│      │                  │      │ (Findings, Audit) │
└───────────────┘      └──────────────────┘      └───────────────────┘
        │
        ▼
┌────────────────────────────────────────────┐
│  Reporting + HITL UI (React) + PR Bot      │
└────────────────────────────────────────────┘

3.1 Discovery Engine

Responsibility: Continuously enumerate the authorized external attack surface and internal assets. Implements: subfinder → dnsx → naabu → httpx → katana pipeline; BloodHound CE for Active Directory; read-only cloud inventory collectors (CloudQuery/Steampipe). Emits: AssetDiscovered events on discovery.events. Constraint: Every tool invocation is checked against a signed Engagement scope object (CIDR + domain allowlist, rate ceilings, blackout windows, active/passive mode). Out-of-scope calls are refused at the orchestration layer.

3.2 Vulnerability Intelligence Service

Responsibility: Match discovered components to CVEs, KEV, EPSS, and misconfiguration templates. Implements: Nuclei (templates), Grype (CVE matching from SBOM), Trivy (OS/IaC/secrets/K8s breadth). Maintains a local mirror of NVD, GHSA, OSV, EPSS daily snapshot, CISA KEV (zero-egress). Key rule: Version-aware pre-filter — match CVEs against the actual deployed software version before writing to the graph, eliminating the bulk of version-irrelevant CVE noise. Emits: VulnerabilityMatched events.

3.3 SAST Pipeline

Responsibility: Scan internal repositories at PR time and nightly. Implements: Semgrep (fast PR feedback) + CodeQL (deep nightly taint analysis) + Joern (high-precision specialty pass). Reachability layer fuses CodeQL call graph + SVF (C/C++) + SootUp (Java) to drop dead-code findings. Emits: CodeFinding{repo, commit, file, line, rule_id, sink_kind, taint_sources, reachability_status, confidence}.

3.4 LLM Orchestrator + Patch Generator

Responsibility: Turn correlated findings into developer-quality patches and generate reports. Implements: Self-hosted vLLM serving a coder model (e.g., Qwen2.5-Coder-32B) and a reasoner model (e.g., Llama 3.3 70B). RAG context from Qdrant. Patch verification (pre-human): Every generated patch is re-run through Semgrep + CodeQL diff query + project unit tests in an ephemeral sandbox before any human sees it. Failed patches are dropped and logged as RAG-negative examples.

3.5 Correlation Engine

The centerpiece — see Section 6.

3.6 Graph Database (Neo4j)

Stores live topology, attack graph, and external↔internal correlations as a property graph. GDS library computes centrality and shortest paths.

3.7 Vector Store (Qdrant)

RAG index for code chunks, framework docs, historical accepted patches, CVE descriptions. Embeddings served on-prem via Text-Embeddings-Inference (bge-large-en-v1.5 or nomic-embed-text).

3.8 Findings Store (PostgreSQL + pgvector)

Authoritative ACID system of record for findings, audit trails, HITL decisions, patch history. pgvector column used for semantic dedup ("is this the same finding as last week?").

3.9 Reporting + HITL UI

React/Next.js frontend over a GraphQL gateway. A GitHub/GitLab/Bitbucket bot opens PRs containing: the attack path, the correlation evidence, the SAST rule, the diff, and sandbox test results — with one-click approve/reject. Rejections flow back into the patch-history RAG index.

4. Data Model (Graph + Relational + Vector)

4.1 Neo4j property graph

Node labels: Asset, Service, Software, CVE, Misconfiguration, Repository, Commit, File, Function, Image, Pipeline, Identity, Finding, AttackPathStep.

Core relationships:

Edge	From → To	Meaning
`RUNS`	Asset → Service	Asset hosts a service
`LISTENS_ON`	Service → Port	Service exposes a port
`CONNECTS_TO`	Asset → Asset	Network reachability
`HAS_CVE`	Software → CVE	Component is vulnerable
`BUILT_FROM`	Image → Commit	The key external↔internal edge
`DEFINED_IN`	Service → Repository	Service's source repo
`DEPLOYED_BY`	Asset → Pipeline	CI/CD that deployed it
`CHAINS_TO`	AttackPathStep → AttackPathStep	Vulnerability chaining
`FIXED_BY`	Finding → Patch	Remediation link

Enrichment properties on attack-path nodes (used for prioritization): cvss, epss, kev (bool), node_centrality, reachability_from_entry, asset_criticality.

4.2 PostgreSQL (system of record)

findings(id, tenant_id, type, severity, status, asset_ref, repo_ref,
         commit_sha, file_path, line, rule_id, confidence,
         correlation_evidence JSONB, created_at, dedup_vector vector(1024))

hitl_decisions(id, finding_id, decision, reason, actor, decided_at)

patch_history(id, finding_id, diff, verifier_result JSONB, accepted BOOL,
              feedback_text, created_at)

audit_ledger(id, prev_hash, payload JSONB, hash, created_at)  -- append-only, hash-chained

engagements(id, tenant_id, scope JSONB, rate_limits JSONB,
            blackout_windows JSONB, mode, signature, signed_by)

4.3 Qdrant collections

code_embeddings, patch_history, cve_descriptions, framework_docs — each with payload filters (tenant_id, repo, language) for hybrid filter+vector search.

5. End-to-End Data Flow

Engagement (signed scope)
    │
    ▼
LangGraph Coordinator ── ReAct loop ──► Tool nodes (Nmap/Nuclei/Semgrep/...)
    │                        │                      │
    │                        ▼                      ▼
    │                  Verifier node ◀──────── Tool output
    │                        │
    │                        ▼ (only verified facts)
    │                     Neo4j upserts
    │
    ▼
Correlation Engine fuses signals ──► Confidence-scored matches ──► PostgreSQL findings
    │
    ▼
For each AUTO_LINK finding:
    LLM Orchestrator ──► RAG retrieve (Qdrant) ──► Patch draft
                              │
                              ▼
                       Patch Verifier (Semgrep+CodeQL+tests, sandbox)
                              │
                              ▼
                       HITL PR Bot ──► developer accept/reject
                              │
                              ▼
                       Outcome feedback ──► Qdrant patch_history (RAG)

Flow, narrated:

Discovery → Graph. Discovery Engine emits AssetDiscovered. The Verifier confirms each fact (re-grab banner, re-resolve DNS) before the Coordinator upserts nodes into Neo4j. Unverified LLM-asserted edges are quarantined, never committed.
Vuln Intel → Graph. CVEs/misconfigs matched (version-filtered) and written as HAS_CVE edges with cvss/epss/kev.
Chaining → Attack Graph. k-shortest-path computation from declared entry points (internet, partner, insider) to crown-jewel assets, scored by CVSS × EPSS × (1 + reachability_bonus) / mitigation_score. Never enumerate all paths (path-explosion defense).
Correlation. For each top-ranked vulnerable service, the Correlation Engine fuses signals to find the exact repo/commit/file (see §6).
Remediation. For AUTO_LINK findings, the LLM Orchestrator generates a patch with RAG context; the Patch Verifier re-checks it in a sandbox.
Approval. The PR bot surfaces evidence + diff; the human approves or rejects; the outcome feeds back into the RAG index (the only "learning" loop in a zero-egress deployment).

6. The Correlation Engine (Core Differentiator)

The Correlation Engine answers one question with high accuracy and low false-positive rate:

"This vulnerable running service — which exact line of source code, in which commit, produced it?"

It is a confidence-scored, multi-signal join, not a single ML model. Signals are layered from deterministic (cryptographic) down to probabilistic (learned).

6.1 The Signal Stack

Tier	Signal	Trust	How obtained
1. Cryptographic	SLSA in-toto provenance verified via Sigstore/Cosign	1.00	`cosign verify-attestation`; `slsa-verifier verify-image`
2. Declarative	OCI labels: `org.opencontainers.image.source`, `.revision`	0.85	`docker buildx imagetools inspect --raw`
3. Declarative	SBOM (Syft) with package PURLs and source hints	0.80	`syft <image> -o cyclonedx-json`
4. Pipeline	CI job ID / run URL stored as image label	0.75	Parsed from build-time labels
5. Deployment	K8s annotations, ArgoCD app source, Helm values	0.70	K8s API + ArgoCD CRD watch
6. IaC	Terraform state → module → resource → image tag	0.65	Terraform state parsing
7. Network	Banner, JARM, HTTP tech fingerprint, favicon hash	0.40–0.65	`httpx` + Nuclei tech-detect
8. Learned	Source structure ↔ runtime endpoint (TF-IDF candidates → BERT pairwise classifier)	0.30–0.85	Two-stage retrieval + classification
9. AST/symbol	"Hot-spot" unique class/route names matched to deployed bytecode/JS	0.50–0.80	tree-sitter AST extraction, JS de-bundling, JVM class extraction

6.2 Confidence Scoring Algorithm

Each signal s_i yields (weight_i, score_i ∈ [0,1]). Fuse with noisy-OR + contradiction penalty:

def fuse(signals):
    p = 1.0
    for w, s in signals:
        p *= (1 - w * s)
    p_match = 1 - p
    return p_match * contradiction_penalty(signals)

# contradiction_penalty drops to 0.5 if two signals point at different
# repos beyond tolerance → emit a "provenance conflict" finding,
# NEVER silently pick one.

Decision thresholds:

Band	Condition	Action
AUTO_LINK	`adjusted ≥ 0.95` and ≥1 Tier-1 signal	Used directly in patch generation
PROPOSED_LINK	`0.70 ≤ adjusted < 0.95`	One-click human confirm
CANDIDATE	`adjusted < 0.70`	Graph annotation only; never patched

Hard rule: The LLM patch generator never operates on a correlation below AUTO_LINK without explicit HITL approval. This single gate prevents most "patched the wrong file" errors.

6.3 Worked Example: Attack Path → Exact Code

Discovery: api.example.com:443 runs Spring Boot 2.7.4. Nuclei flags CVE-2022-22965 (Spring4Shell, CVSS 9.8, EPSS 0.97, KEV).
Chaining: Verifier confirms the reverse proxy actually proxies the vulnerable endpoint (non-exploitative HEAD/OPTIONS probe). Centrality + EPSS×CVSS re-rank this path to the top.
Correlation:
- K8s API → pod image digest sha256:abc….
- cosign verify-attestation → SLSA provenance: source github.com/octo/spring-app, commit a1b2c3d. Tier-1, score 1.00.
- docker buildx imagetools inspect → org.opencontainers.image.revision=a1b2c3d. Tier-2 corroborates.
- Syft SBOM → spring-webmvc@5.3.23. Tier-3 corroborates.
- Fusion → 1.0 → AUTO_LINK.
Code anchoring: Correlation Engine asks SAST: "in octo/spring-app@a1b2c3d, find public Spring controllers using the vulnerable DataBinder pattern." CodeQL returns src/main/java/com/octo/web/UserController.java:42; reachability confirms it's reachable from DispatcherServlet.
Patch: Coder LLM gets the snippet + CodeQL alert + Spring advisory (RAG) + 3 historical accepted fixes. Produces a diff (@InitBinder with setDisallowedFields(...) or dependency upgrade). Verifier re-runs Semgrep+CodeQL+tests in sandbox — all pass.
PR: Bot opens "Fix CVE-2022-22965 in UserController (reachable from internet via api.example.com)" with attack-path SVG, 4-signal correlation evidence (confidence 1.0), diff, test results, approve/reject.

Result: a perimeter Spring4Shell CVE resolved with a verified patch against an exact line, with cryptographic evidence binding the container to that code.

6.4 Anti-False-Attribution Verification Layers

Provenance conflict detector — disagreeing Tier-1/2 signals → surface as a supply-chain finding, don't auto-link.
Sigstore Rekor cross-check — verify inclusion proof offline against a pinned trusted root; fail closed if missing.
Signed-SBOM requirement — trust an SBOM only if attached as a signed in-toto attestation.
Reachability gate — no patch unless the vulnerable symbol is reachable from a public entry point.
Verifier agent on every LLM-asserted edge — re-validate with a deterministic tool call before committing to Neo4j.

7. The Agentic Orchestration Layer

Framework: LangGraph (state-machine + durable execution + checkpointing).

Design rules (security-critical):

Planner/Executor split. The planner LLM has no tool access. The executor LLM has tools but is constrained to JSON I/O and an enforced allowlist. (Agent isolation is the single strongest prompt-injection defense.)
JSON-structured tool I/O. All external content (banners, scraped HTML, CVE text) is wrapped in delimiters and never concatenated into the system-prompt zone.
Bounded budget. Max tool calls and max wall-clock per goal (e.g., 50 calls / 20 min) prevents runaway agents.
Tool allowlist at the graph level. The LLM cannot invoke an out-of-scope tool regardless of what it generates.
Checkpoint every transition to Postgres so interrupted scans resume deterministically.
No silent state-mutating tool calls. Any tool that opens a PR, writes a file, or changes a firewall rule requires HITL approval.

The Verifier agent (separate, smaller/cheaper model or deterministic rules) re-checks every claim before it becomes a graph fact:

Numeric/factual claim → must invoke a tool to confirm (re-probe banner, re-resolve DNS, re-query inventory).
Low-confidence edges → quarantine queue, never the live graph.

8. Technology Stack

Discovery & Vulnerability Scanning

Need	Choice	Notes
Port scanning	Naabu (+ Nmap second pass for OS fingerprint)	Default rate capped low (e.g., 200 pps) for safety
Subdomain enum	Subfinder (+ periodic Amass)	Passive-first
HTTP fingerprint	Httpx	Tech detection
Web crawling	Katana	Security-recon crawler
Template vuln scan	Nuclei (+ OWASP ZAP for authenticated DAST)	9,000+ community templates
AD/identity	BloodHound CE (Neo4j-backed)	AD attack paths
Cloud inventory	CloudQuery + Steampipe (read-only IAM)	Normalizes to Postgres

SAST & Reachability

Need	Choice	Notes
PR-time SAST	Semgrep	Fast; filter aggressively
Deep nightly SAST	CodeQL	Deepest open-source data-flow
High-precision pass	Joern	Low FPR specialty
Reachability	CodeQL + SVF (C/C++) + SootUp (Java)	Multi-tool fusion
FP filter (Phase 4)	LLM agent over SAST output	Empirically cuts SAST false positives dramatically

SBOM & Container Provenance

Need	Choice
SBOM generation	Syft (+ Trivy for breadth)
Vuln matching	Grype + Trivy (different DB coverage)
Image signing/attestation	Cosign + Sigstore (offline verify for air-gap)
SLSA verify	slsa-verifier
Continuous SBOM monitoring	Dependency-Track

AI / Agents / Data

Need	Choice	Notes
Orchestration	LangGraph	State machine + checkpointing
LLM serving	vLLM (prod), Ollama (dev/single-box)	Air-gap friendly
Reasoner LLM	Llama 3.3 70B or DeepSeek-V3 (quantized)	Open-weight
Coder LLM	Qwen2.5-Coder-32B or DeepSeek-Coder-V2	Open-weight
Embeddings	bge-large-en-v1.5 / nomic-embed-text	Self-hosted via TEI
Vector DB	Qdrant	Single binary, payload filters
Graph DB	Neo4j Community (Memgraph/NebulaGraph swap >10M nodes)	GDS library
Findings DB	PostgreSQL + pgvector	ACID + dedup
Blob store	MinIO (S3-compatible)	Immutable scan output, SBOMs
Bus	Kafka / Redpanda	Event backbone
Metrics	Prometheus + Grafana

Attack-Graph References (study/optional, not vendored)

MulVAL (logical attack graph, polynomial), CAULDRON/TVA (exploit-dependency model), MITRE CALDERA (ATT&CK emulation plug-in), PentAGI / CAI / Strix (open-source agent references — none of which correlate to source code, confirming the gap SynoSec fills).

9. Inter-Service Contracts (APIs & Events)

All services expose gRPC (Protobuf) and publish/consume Kafka events. Representative schemas:

Event: AssetDiscovered (topic discovery.events)

{
  "asset_id": "uuid", "tenant_id": "uuid",
  "ip": "203.0.113.10", "port": 443, "proto": "https",
  "service": "spring-boot", "version": "2.7.4",
  "banner": "...", "tls_jarm": "...", "fingerprint": {"...": "..."},
  "hashes": {"favicon": "..."}, "source_tool": "httpx",
  "timestamp": "2026-06-08T12:00:00Z", "confidence": 0.92
}

Event: VulnerabilityMatched (topic vuln.events)

{
  "asset_id": "uuid", "cve_id": "CVE-2022-22965",
  "cvss": 9.8, "epss": 0.97, "kev": true,
  "version_match": true, "source_tool": "nuclei"
}

Event: CodeFinding (topic sast.events)

{
  "repo": "github.com/octo/spring-app", "commit": "a1b2c3d",
  "file": "src/main/java/com/octo/web/UserController.java", "line": 42,
  "rule_id": "java/spring-disallowed-fields", "sink_kind": "data-binder",
  "taint_sources": ["request.param"], "reachability_status": "reachable",
  "confidence": 0.91
}

gRPC: CorrelationService.Correlate

rpc Correlate(CorrelateRequest) returns (CorrelateResponse);

message CorrelateRequest { string asset_id = 1; string image_digest = 2; }
message CorrelateResponse {
  string repo = 1; string commit = 2;
  float confidence = 3;
  Band band = 4;                    // AUTO_LINK | PROPOSED_LINK | CANDIDATE
  repeated Signal signals = 5;      // evidence trail
  bool provenance_conflict = 6;
}

gRPC: RemediationService.GeneratePatch → returns unified diff + justification + self-grade + verifier result.

10. Security, Posture & Safety Controls

Non-disruptive scanner operation

Signed Engagement object declares scope, per-tool rate ceilings, blackout windows, and active/passive mode.
Two modes: PASSIVE (banner/DNS/cert-transparency/SBOM only) and ACTIVE_SAFE (port scan with adaptive backoff, Nuclei filtered to safe=true, no exploitation templates).
Rate limiting: per-target token bucket; global circuit breaker halves rate if 5xx exceeds threshold.
Validation, not exploitation: version match + reachability + optional sandboxed PoC against a customer-provided digital twin — never against production.
Change-window/freeze flags honored from the customer CMDB.

LLM security

Zero-egress: all inference on-prem; nothing leaves the perimeter.
Prompt-injection defense-in-depth: planner/executor isolation, JSON-structured I/O, an input prompt-injection classifier, an outbound egress allowlist on agent containers, and no silent state-mutating tool calls.
Sandboxing: every tool runs in an ephemeral container — no host mounts, no cloud credentials, gVisor/Kata kernel isolation, egress restricted to engagement scope.
Audit trail: every prompt, tool call, output, graph mutation, and HITL decision hash-chained into an append-only Postgres ledger.
Data retention: per-tenant AES-GCM at rest (Vault/KMS). Defaults: scan data 90 days, findings indefinite, LLM I/O 30 days (configurable to ephemeral).

HITL control tiers

Tier	Scope	Behavior
0	Read-only PASSIVE scans, allowlisted assets	Auto-execute
1	ACTIVE_SAFE scans, SAST, draft patch generation	Auto-execute + audit
2	PR creation, sub-AUTO_LINK edge commits, sandboxed PoC	Approve-before-execute
3	Scope changes, tool-allowlist changes, model upgrades	Two-person approval

Scalability

Discovery sharding by tenant namespace with Redis distributed token buckets.
Path-explosion defense: k-shortest paths on the attack subgraph, never full enumeration; logical-attack-graph model (polynomial in network size).
Prioritization scoring: CVSS × EPSS × (1 + reachability_bonus) × asset_criticality × centrality / mitigation_score. EPSS daily-refreshed.
Graph scale: Neo4j Community to ~10M nodes; Memgraph/NebulaGraph migration path beyond.
LLM cost control: cheap Verifier model for ~80% of checks; reserve the large reasoner for novel/low-confidence correlations; cache RAG retrievals.

11. Development Stages (Build Plan)

Build the deterministic path first. The learned matcher (Phase 4) can lag without breaking the product thesis — deterministic SLSA/OCI/SBOM joins already exceed competitor accuracy.

Stage 1 — Foundation (months 0–3)

Goal: discover → graph, end to end, safely.

Stand up Discovery Engine (Naabu/Subfinder/Httpx/Katana/Nuclei) inside a LangGraph orchestrator with strict scope enforcement.
Stand up SBOM pipeline (Syft + Grype + Trivy) and Neo4j topology store.
Define the property-graph schema (Asset, Service, Software, CVE, Repository, Commit, Image, Finding).
Stand up Kafka bus, Postgres findings store, MinIO blob store.
Gate: scan a synthetic 1,000-asset environment end-to-end; graph populated; zero production-impact incidents.

Stage 2 — SAST + LLM Remediation Skeleton (months 3–6)

Goal: generate verified patches behind a HITL PR.

Integrate Semgrep (PR-time) + CodeQL (nightly).
Stand up vLLM + coder model; build RAG over Qdrant with repo embeddings.
Build the HITL PR bot (GitHub/GitLab apps).
Implement the Patch Verifier (sandboxed Semgrep+CodeQL+tests before PR open).
Gate: ≥70% of generated patches pass the Verifier sandbox on a curated corpus; ≥40% accepted by developers in pilot.

Stage 3 — Correlation Engine MVP, Deterministic Only (months 6–9)

Goal: ship the moat using cryptographic + declarative signals.

Implement Tier-1 → Tier-6 signals: Sigstore/SLSA verify, OCI label parsing, SBOM-PURL joins, CI metadata, K8s/ArgoCD parsing, Terraform-state parsing.
Implement the confidence-scoring engine (noisy-OR + contradiction penalty).
Wire AUTO_LINK / PROPOSED_LINK / CANDIDATE gating into PR generation.
Gate: ≥95% of AUTO_LINK matches correct on developer review; zero false patches merged.

Stage 4 — Learned Matching + Attack-Path Chaining (months 9–12)

Goal: cover customers lacking provenance; add chaining.

Build the two-stage fingerprint matcher (TF-IDF candidates → BERT pairwise) for Java/JS/Python/Go.
Implement attack-graph chaining (k-shortest path) with EPSS×CVSS×centrality×reachability scoring.
Add the LLM-based SAST false-positive filter.
Gate: customer reports "found a path competitors missed" or "explained a CVE with the exact commit" within 30 days post-deploy.

Stage 5 — Verifier Hardening, Validation & Scale (months 12–18)

Goal: production hardening + on-prem learning loop.

Verifier agent on every LLM-asserted graph mutation; track hallucination metrics.
Sandboxed PoC against customer digital twins (opt-in).
Memgraph/NebulaGraph migration option for >10M-node tenants.
Per-tenant feedback RAG (accept/reject) closes the learning loop.

Replan triggers:

Correlation precision < 90% in Stage 3 → delay Stage 4; invest in declarative signal coverage (enforce OCI labels via admission control).
Patch acceptance < 30% in Stage 2 → deepen the Verifier sandbox before scaling.
Any scan-related production incident → halt ACTIVE_SAFE, revert to PASSIVE until root cause shipped.
Graph query latency > 5s at customer scale → subgraph caching or Memgraph migration before adding enrichment.

12. Repository Layout

synosec/
├── README.md
├── deploy/
│   ├── helm/                      # single air-gapped Helm chart
│   └── docker-compose.dev.yml     # local single-box (Ollama, Neo4j, Qdrant, PG)
├── proto/                         # shared gRPC/Protobuf contracts
├── services/
│   ├── discovery-engine/          # Naabu/Subfinder/Httpx/Katana/BloodHound wrappers
│   ├── vuln-intel/                # Nuclei/Grype/Trivy + local NVD/EPSS/KEV mirror
│   ├── sast-pipeline/             # Semgrep/CodeQL/Joern + reachability fusion
│   ├── llm-orchestrator/          # vLLM client, patch gen, RAG, patch verifier
│   ├── correlation-engine/        # signal collectors, fusion scorer, verifiers
│   ├── orchestration/             # LangGraph planner/executor/verifier graphs
│   ├── reporting-ui/              # React/Next.js + GraphQL gateway + PR bot
│   └── common/                    # auth (mTLS), audit ledger, engagement scope
├── data/
│   ├── neo4j-schema/              # constraints, indexes, GDS projections
│   ├── postgres-migrations/
│   └── qdrant-collections/
├── agents/
│   ├── planner/ executor/ verifier/   # prompts (JSON-schema), tool allowlists
└── test/
    ├── synthetic-lab/             # 1,000-asset env, 6-VM PoC, digital twins
    └── corpora/                   # patch corpus, OWASP Benchmark harness

13. Local Development & Deployment

Single-box dev (docker-compose.dev.yml): Ollama (coder + reasoner), Neo4j Community, Qdrant, PostgreSQL+pgvector, MinIO, Redpanda. Discovery tools as sidecar containers. Bring up:

docker compose -f deploy/docker-compose.dev.yml up -d
make seed-synthetic-lab        # spins up the 1,000-asset test environment
make run-engagement SCOPE=test/synthetic-lab/scope.json

Production (air-gapped): helm install synosec deploy/helm -f values.airgap.yaml. All model weights, vuln DBs (NVD/GHSA/OSV/EPSS/KEV), and container images are mirrored into the customer registry beforehand. No outbound network required at runtime.

GPU sizing: reasoner (70B quantized) ≈ 1–2× A100/H100; coder (32B) ≈ 1× A100; embeddings on CPU or shared GPU. Verifier uses the smaller/cheaper model.

14. Testing & Acceptance Gates

Unit + contract tests per service against the Protobuf schemas.
Synthetic lab (test/synthetic-lab/) — a controlled 1,000-asset environment with known vulnerabilities and a known repo↔image mapping, used to measure correlation precision/recall.
OWASP Benchmark harness — measures SAST + LLM-filter false-positive reduction.
Patch corpus — measures Verifier pass rate and developer acceptance.
Posture tests — assert that out-of-scope targets are refused, rate ceilings hold, and ACTIVE_SAFE never fires exploitation templates.
Hallucination metric — sampled LLM-as-judge spot-check on graph edges and patches; tracked as a top-line product metric.

Definition of done per stage = the stage gate in Section 11.

15. Glossary

Term	Meaning
Attack graph	Graph of how chained vulnerabilities let an attacker move from an entry point to a target
AUTO_LINK	Correlation confidence ≥ 0.95 with a cryptographic signal; safe to patch automatically (still HITL-approved)
Correlation Engine	The service that links an external vulnerability to the exact source code
HITL	Human-in-the-loop — required human approval before state-changing actions
Non-exploitative validation	Confirming exploit feasibility without firing live payloads at production
Provenance (SLSA/in-toto)	Cryptographic attestation binding a built artifact to its source commit and builder
ReAct	Reasoning + Acting agent loop (think → call tool → observe → repeat)
Reachability	Whether a vulnerable code symbol is actually callable from a public entry point
Verifier agent	The component that re-checks every LLM-asserted fact deterministically before it enters the graph
Zero-egress	No data leaves the customer perimeter; all inference and storage on-prem

This README is the authoritative internal build specification. Build deterministic before learned, validate before exploit, and never let an unverified LLM assertion become a committed fact.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agents		agents
data		data
deploy		deploy
proto		proto
services		services
test		test
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
development_plan.md		development_plan.md
stage 1 and stage 2.md		stage 1 and stage 2.md

Folders and files

Latest commit

History

Repository files navigation

SynoSec — AI-Driven Continuous Attack Surface Validation & Remediation Platform

Table of Contents

1. What This Tool Does

2. System Overview & Mental Model

3. Architecture: The Nine Services

3.1 Discovery Engine

3.2 Vulnerability Intelligence Service

3.3 SAST Pipeline

3.4 LLM Orchestrator + Patch Generator

3.5 Correlation Engine

3.6 Graph Database (Neo4j)

3.7 Vector Store (Qdrant)

3.8 Findings Store (PostgreSQL + pgvector)

3.9 Reporting + HITL UI

4. Data Model (Graph + Relational + Vector)

4.1 Neo4j property graph

4.2 PostgreSQL (system of record)

4.3 Qdrant collections

5. End-to-End Data Flow

6. The Correlation Engine (Core Differentiator)

6.1 The Signal Stack

6.2 Confidence Scoring Algorithm

6.3 Worked Example: Attack Path → Exact Code

6.4 Anti-False-Attribution Verification Layers

7. The Agentic Orchestration Layer

8. Technology Stack

Discovery & Vulnerability Scanning

SAST & Reachability

SBOM & Container Provenance

AI / Agents / Data

Attack-Graph References (study/optional, not vendored)

9. Inter-Service Contracts (APIs & Events)

10. Security, Posture & Safety Controls

Non-disruptive scanner operation

LLM security

HITL control tiers

Scalability

11. Development Stages (Build Plan)

Stage 1 — Foundation (months 0–3)

Stage 2 — SAST + LLM Remediation Skeleton (months 3–6)

Stage 3 — Correlation Engine MVP, Deterministic Only (months 6–9)

Stage 4 — Learned Matching + Attack-Path Chaining (months 9–12)

Stage 5 — Verifier Hardening, Validation & Scale (months 12–18)

12. Repository Layout

13. Local Development & Deployment

14. Testing & Acceptance Gates

15. Glossary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages