Skip to content

alsaifybashar/synosec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SynoSec — AI-Driven Continuous Attack Surface Validation & Remediation Platform

An end-to-end, self-hosted security platform that discovers external infrastructure, builds a live attack graph, correlates each external vulnerability back to the exact source code that deployed it, and generates verified, human-approved code fixes.

The differentiator is the Correlation Engine: where other tools stop at "patch this CVE on this host," SynoSec closes the loop to UserController.java:42 in commit a1b2c3d, with cryptographic evidence binding the running container to that code.


Table of Contents

  1. What This Tool Does
  2. System Overview & Mental Model
  3. Architecture: The Nine Services
  4. Data Model (Graph + Relational + Vector)
  5. End-to-End Data Flow
  6. The Correlation Engine (Core Differentiator)
  7. The Agentic Orchestration Layer
  8. Technology Stack
  9. Inter-Service Contracts (APIs & Events)
  10. Security, Posture & Safety Controls
  11. Development Stages (Build Plan)
  12. Repository Layout
  13. Local Development & Deployment
  14. Testing & Acceptance Gates
  15. Glossary

1. What This Tool Does

SynoSec operates across two domains and bridges them:

External Domain (Infrastructure & Network Mapping)

  • Autonomous discovery — an AI-guided scanner traverses authorized infrastructure, identifying systems, running services, software, and technologies.
  • Dynamic graph mapping — as nodes are discovered, the system evaluates them for further connections, building a continuous live topology (nodes and edges).
  • Vulnerability detection — identifies CVEs, misconfigurations, and weaknesses for each discovered component.
  • Attack path chaining — analyzes how isolated vulnerabilities chain across connected nodes to reach deeper systems, producing an attack graph.

Internal Domain (Codebase & Remediation)

  • Static code analysis (SAST) — scans the internal codebase for vulnerabilities.
  • LLM-driven remediation — combines SAST findings with LLMs to generate code-level fixes.
  • Human-in-the-loop (HITL) — surfaces fixes via a UI/PR workflow for approval or rejection.

The Bridge (the reason this tool exists)

  • The Correlation Engine links an external attack path to the exact repository, commit, file, and function that produced the vulnerable running service — then triggers a verified patch against that code.

Posture: SynoSec performs continuous, non-exploitative validation. It confirms exploit feasibility (version match + reachability + optional sandboxed proof-of-concept against a digital twin) rather than firing live payloads against production.

Deployment constraint: Zero-egress, fully self-hosted. No prompts, code, scan data, or findings ever leave the customer perimeter. All LLM inference runs on-prem.


2. System Overview & Mental Model

The simplest way to reason about SynoSec:

"Outside-in discovery feeds a graph. The graph plus code feeds a correlation. The correlation feeds a verified patch. A human approves the patch."

DISCOVER  ──►  GRAPH  ──►  CORRELATE  ──►  REMEDIATE  ──►  APPROVE
(external)    (Neo4j)     (Correlation     (LLM patch)    (HITL PR)
                           Engine)

Everything is orchestrated by a LangGraph state machine running a ReAct reasoning agent plus a Verifier agent that gates every fact before it enters the graph. Tools (scanners, SAST engines, SBOM tools) are deterministic; the LLM only plans and explains — it never commits an unverified fact.


3. Architecture: The Nine Services

All services run as independently deployable containers, communicate over an internal gRPC/Protobuf mesh (mTLS), and emit events to a Kafka/Redpanda bus. The entire platform ships as a single Helm chart for air-gapped operation.

                     ┌────────────────────────────────────────────────┐
                     │              Orchestration Plane               │
                     │  ┌────────────────┐  ┌─────────────────────┐   │
                     │  │ LangGraph ReAct│  │  Verifier Agent     │   │
                     │  │   Coordinator  │◀▶│ (hallucination gate)│   │
                     │  └───────┬────────┘  └─────────────────────┘   │
                     └──────────┼─────────────────────────────────────┘
                                │
   ┌────────────────────────────┼────────────────────────────────────┐
   │                            ▼                                    │
┌──┴────────────┐ ┌──────────────────────┐ ┌──────────────────────┐ ┌┴─────────────┐
│ Discovery     │ │ Vulnerability        │ │ SAST Pipeline        │ │ LLM          │
│ Engine        │ │ Intelligence         │ │ (Semgrep + CodeQL)   │ │ Orchestrator │
│               │ │                      │ │                      │ │ + Patch Gen  │
└───────┬───────┘ └──────────┬───────────┘ └──────────┬───────────┘ └──────┬───────┘
        │                    │                        │                    │
        ▼                    ▼                        ▼                    ▼
┌────────────────────────────────────────────────────────────────────────────────┐
│                            CORRELATION ENGINE                                  │
│   Deterministic Joiner │ Fingerprint Matcher │ Confidence Scorer │ Attestation │
│                                                                    Verifier    │
└────────────────────────────────┬───────────────────────────────────────────────┘
                                 │
        ┌────────────────────────┼────────────────────────┐
        ▼                        ▼                        ▼
┌───────────────┐      ┌──────────────────┐      ┌───────────────────┐
│  Neo4j        │      │ Qdrant (RAG)     │      │ PostgreSQL +      │
│ (Topology +   │      │ + Code Embeddings│      │ pgvector          │
│  Attack Graph)│      │                  │      │ (Findings, Audit) │
└───────────────┘      └──────────────────┘      └───────────────────┘
        │
        ▼
┌────────────────────────────────────────────┐
│  Reporting + HITL UI (React) + PR Bot      │
└────────────────────────────────────────────┘

3.1 Discovery Engine

Responsibility: Continuously enumerate the authorized external attack surface and internal assets. Implements: subfinderdnsxnaabuhttpxkatana pipeline; BloodHound CE for Active Directory; read-only cloud inventory collectors (CloudQuery/Steampipe). Emits: AssetDiscovered events on discovery.events. Constraint: Every tool invocation is checked against a signed Engagement scope object (CIDR + domain allowlist, rate ceilings, blackout windows, active/passive mode). Out-of-scope calls are refused at the orchestration layer.

3.2 Vulnerability Intelligence Service

Responsibility: Match discovered components to CVEs, KEV, EPSS, and misconfiguration templates. Implements: Nuclei (templates), Grype (CVE matching from SBOM), Trivy (OS/IaC/secrets/K8s breadth). Maintains a local mirror of NVD, GHSA, OSV, EPSS daily snapshot, CISA KEV (zero-egress). Key rule: Version-aware pre-filter — match CVEs against the actual deployed software version before writing to the graph, eliminating the bulk of version-irrelevant CVE noise. Emits: VulnerabilityMatched events.

3.3 SAST Pipeline

Responsibility: Scan internal repositories at PR time and nightly. Implements: Semgrep (fast PR feedback) + CodeQL (deep nightly taint analysis) + Joern (high-precision specialty pass). Reachability layer fuses CodeQL call graph + SVF (C/C++) + SootUp (Java) to drop dead-code findings. Emits: CodeFinding{repo, commit, file, line, rule_id, sink_kind, taint_sources, reachability_status, confidence}.

3.4 LLM Orchestrator + Patch Generator

Responsibility: Turn correlated findings into developer-quality patches and generate reports. Implements: Self-hosted vLLM serving a coder model (e.g., Qwen2.5-Coder-32B) and a reasoner model (e.g., Llama 3.3 70B). RAG context from Qdrant. Patch verification (pre-human): Every generated patch is re-run through Semgrep + CodeQL diff query + project unit tests in an ephemeral sandbox before any human sees it. Failed patches are dropped and logged as RAG-negative examples.

3.5 Correlation Engine

The centerpiece — see Section 6.

3.6 Graph Database (Neo4j)

Stores live topology, attack graph, and external↔internal correlations as a property graph. GDS library computes centrality and shortest paths.

3.7 Vector Store (Qdrant)

RAG index for code chunks, framework docs, historical accepted patches, CVE descriptions. Embeddings served on-prem via Text-Embeddings-Inference (bge-large-en-v1.5 or nomic-embed-text).

3.8 Findings Store (PostgreSQL + pgvector)

Authoritative ACID system of record for findings, audit trails, HITL decisions, patch history. pgvector column used for semantic dedup ("is this the same finding as last week?").

3.9 Reporting + HITL UI

React/Next.js frontend over a GraphQL gateway. A GitHub/GitLab/Bitbucket bot opens PRs containing: the attack path, the correlation evidence, the SAST rule, the diff, and sandbox test results — with one-click approve/reject. Rejections flow back into the patch-history RAG index.


4. Data Model (Graph + Relational + Vector)

4.1 Neo4j property graph

Node labels: Asset, Service, Software, CVE, Misconfiguration, Repository, Commit, File, Function, Image, Pipeline, Identity, Finding, AttackPathStep.

Core relationships:

Edge From → To Meaning
RUNS Asset → Service Asset hosts a service
LISTENS_ON Service → Port Service exposes a port
CONNECTS_TO Asset → Asset Network reachability
HAS_CVE Software → CVE Component is vulnerable
BUILT_FROM Image → Commit The key external↔internal edge
DEFINED_IN Service → Repository Service's source repo
DEPLOYED_BY Asset → Pipeline CI/CD that deployed it
CHAINS_TO AttackPathStep → AttackPathStep Vulnerability chaining
FIXED_BY Finding → Patch Remediation link

Enrichment properties on attack-path nodes (used for prioritization): cvss, epss, kev (bool), node_centrality, reachability_from_entry, asset_criticality.

4.2 PostgreSQL (system of record)

findings(id, tenant_id, type, severity, status, asset_ref, repo_ref,
         commit_sha, file_path, line, rule_id, confidence,
         correlation_evidence JSONB, created_at, dedup_vector vector(1024))

hitl_decisions(id, finding_id, decision, reason, actor, decided_at)

patch_history(id, finding_id, diff, verifier_result JSONB, accepted BOOL,
              feedback_text, created_at)

audit_ledger(id, prev_hash, payload JSONB, hash, created_at)  -- append-only, hash-chained

engagements(id, tenant_id, scope JSONB, rate_limits JSONB,
            blackout_windows JSONB, mode, signature, signed_by)

4.3 Qdrant collections

code_embeddings, patch_history, cve_descriptions, framework_docs — each with payload filters (tenant_id, repo, language) for hybrid filter+vector search.


5. End-to-End Data Flow

Engagement (signed scope)
    │
    ▼
LangGraph Coordinator ── ReAct loop ──► Tool nodes (Nmap/Nuclei/Semgrep/...)
    │                        │                      │
    │                        ▼                      ▼
    │                  Verifier node ◀──────── Tool output
    │                        │
    │                        ▼ (only verified facts)
    │                     Neo4j upserts
    │
    ▼
Correlation Engine fuses signals ──► Confidence-scored matches ──► PostgreSQL findings
    │
    ▼
For each AUTO_LINK finding:
    LLM Orchestrator ──► RAG retrieve (Qdrant) ──► Patch draft
                              │
                              ▼
                       Patch Verifier (Semgrep+CodeQL+tests, sandbox)
                              │
                              ▼
                       HITL PR Bot ──► developer accept/reject
                              │
                              ▼
                       Outcome feedback ──► Qdrant patch_history (RAG)

Flow, narrated:

  1. Discovery → Graph. Discovery Engine emits AssetDiscovered. The Verifier confirms each fact (re-grab banner, re-resolve DNS) before the Coordinator upserts nodes into Neo4j. Unverified LLM-asserted edges are quarantined, never committed.
  2. Vuln Intel → Graph. CVEs/misconfigs matched (version-filtered) and written as HAS_CVE edges with cvss/epss/kev.
  3. Chaining → Attack Graph. k-shortest-path computation from declared entry points (internet, partner, insider) to crown-jewel assets, scored by CVSS × EPSS × (1 + reachability_bonus) / mitigation_score. Never enumerate all paths (path-explosion defense).
  4. Correlation. For each top-ranked vulnerable service, the Correlation Engine fuses signals to find the exact repo/commit/file (see §6).
  5. Remediation. For AUTO_LINK findings, the LLM Orchestrator generates a patch with RAG context; the Patch Verifier re-checks it in a sandbox.
  6. Approval. The PR bot surfaces evidence + diff; the human approves or rejects; the outcome feeds back into the RAG index (the only "learning" loop in a zero-egress deployment).

6. The Correlation Engine (Core Differentiator)

The Correlation Engine answers one question with high accuracy and low false-positive rate:

"This vulnerable running service — which exact line of source code, in which commit, produced it?"

It is a confidence-scored, multi-signal join, not a single ML model. Signals are layered from deterministic (cryptographic) down to probabilistic (learned).

6.1 The Signal Stack

Tier Signal Trust How obtained
1. Cryptographic SLSA in-toto provenance verified via Sigstore/Cosign 1.00 cosign verify-attestation; slsa-verifier verify-image
2. Declarative OCI labels: org.opencontainers.image.source, .revision 0.85 docker buildx imagetools inspect --raw
3. Declarative SBOM (Syft) with package PURLs and source hints 0.80 syft <image> -o cyclonedx-json
4. Pipeline CI job ID / run URL stored as image label 0.75 Parsed from build-time labels
5. Deployment K8s annotations, ArgoCD app source, Helm values 0.70 K8s API + ArgoCD CRD watch
6. IaC Terraform state → module → resource → image tag 0.65 Terraform state parsing
7. Network Banner, JARM, HTTP tech fingerprint, favicon hash 0.40–0.65 httpx + Nuclei tech-detect
8. Learned Source structure ↔ runtime endpoint (TF-IDF candidates → BERT pairwise classifier) 0.30–0.85 Two-stage retrieval + classification
9. AST/symbol "Hot-spot" unique class/route names matched to deployed bytecode/JS 0.50–0.80 tree-sitter AST extraction, JS de-bundling, JVM class extraction

6.2 Confidence Scoring Algorithm

Each signal s_i yields (weight_i, score_i ∈ [0,1]). Fuse with noisy-OR + contradiction penalty:

def fuse(signals):
    p = 1.0
    for w, s in signals:
        p *= (1 - w * s)
    p_match = 1 - p
    return p_match * contradiction_penalty(signals)

# contradiction_penalty drops to 0.5 if two signals point at different
# repos beyond tolerance → emit a "provenance conflict" finding,
# NEVER silently pick one.

Decision thresholds:

Band Condition Action
AUTO_LINK adjusted ≥ 0.95 and ≥1 Tier-1 signal Used directly in patch generation
PROPOSED_LINK 0.70 ≤ adjusted < 0.95 One-click human confirm
CANDIDATE adjusted < 0.70 Graph annotation only; never patched

Hard rule: The LLM patch generator never operates on a correlation below AUTO_LINK without explicit HITL approval. This single gate prevents most "patched the wrong file" errors.

6.3 Worked Example: Attack Path → Exact Code

  1. Discovery: api.example.com:443 runs Spring Boot 2.7.4. Nuclei flags CVE-2022-22965 (Spring4Shell, CVSS 9.8, EPSS 0.97, KEV).
  2. Chaining: Verifier confirms the reverse proxy actually proxies the vulnerable endpoint (non-exploitative HEAD/OPTIONS probe). Centrality + EPSS×CVSS re-rank this path to the top.
  3. Correlation:
    • K8s API → pod image digest sha256:abc….
    • cosign verify-attestation → SLSA provenance: source github.com/octo/spring-app, commit a1b2c3d. Tier-1, score 1.00.
    • docker buildx imagetools inspectorg.opencontainers.image.revision=a1b2c3d. Tier-2 corroborates.
    • Syft SBOM → spring-webmvc@5.3.23. Tier-3 corroborates.
    • Fusion → 1.0AUTO_LINK.
  4. Code anchoring: Correlation Engine asks SAST: "in octo/spring-app@a1b2c3d, find public Spring controllers using the vulnerable DataBinder pattern." CodeQL returns src/main/java/com/octo/web/UserController.java:42; reachability confirms it's reachable from DispatcherServlet.
  5. Patch: Coder LLM gets the snippet + CodeQL alert + Spring advisory (RAG) + 3 historical accepted fixes. Produces a diff (@InitBinder with setDisallowedFields(...) or dependency upgrade). Verifier re-runs Semgrep+CodeQL+tests in sandbox — all pass.
  6. PR: Bot opens "Fix CVE-2022-22965 in UserController (reachable from internet via api.example.com)" with attack-path SVG, 4-signal correlation evidence (confidence 1.0), diff, test results, approve/reject.

Result: a perimeter Spring4Shell CVE resolved with a verified patch against an exact line, with cryptographic evidence binding the container to that code.

6.4 Anti-False-Attribution Verification Layers

  1. Provenance conflict detector — disagreeing Tier-1/2 signals → surface as a supply-chain finding, don't auto-link.
  2. Sigstore Rekor cross-check — verify inclusion proof offline against a pinned trusted root; fail closed if missing.
  3. Signed-SBOM requirement — trust an SBOM only if attached as a signed in-toto attestation.
  4. Reachability gate — no patch unless the vulnerable symbol is reachable from a public entry point.
  5. Verifier agent on every LLM-asserted edge — re-validate with a deterministic tool call before committing to Neo4j.

7. The Agentic Orchestration Layer

Framework: LangGraph (state-machine + durable execution + checkpointing).

Design rules (security-critical):

  • Planner/Executor split. The planner LLM has no tool access. The executor LLM has tools but is constrained to JSON I/O and an enforced allowlist. (Agent isolation is the single strongest prompt-injection defense.)
  • JSON-structured tool I/O. All external content (banners, scraped HTML, CVE text) is wrapped in delimiters and never concatenated into the system-prompt zone.
  • Bounded budget. Max tool calls and max wall-clock per goal (e.g., 50 calls / 20 min) prevents runaway agents.
  • Tool allowlist at the graph level. The LLM cannot invoke an out-of-scope tool regardless of what it generates.
  • Checkpoint every transition to Postgres so interrupted scans resume deterministically.
  • No silent state-mutating tool calls. Any tool that opens a PR, writes a file, or changes a firewall rule requires HITL approval.

The Verifier agent (separate, smaller/cheaper model or deterministic rules) re-checks every claim before it becomes a graph fact:

  • Numeric/factual claim → must invoke a tool to confirm (re-probe banner, re-resolve DNS, re-query inventory).
  • Low-confidence edges → quarantine queue, never the live graph.

8. Technology Stack

Discovery & Vulnerability Scanning

Need Choice Notes
Port scanning Naabu (+ Nmap second pass for OS fingerprint) Default rate capped low (e.g., 200 pps) for safety
Subdomain enum Subfinder (+ periodic Amass) Passive-first
HTTP fingerprint Httpx Tech detection
Web crawling Katana Security-recon crawler
Template vuln scan Nuclei (+ OWASP ZAP for authenticated DAST) 9,000+ community templates
AD/identity BloodHound CE (Neo4j-backed) AD attack paths
Cloud inventory CloudQuery + Steampipe (read-only IAM) Normalizes to Postgres

SAST & Reachability

Need Choice Notes
PR-time SAST Semgrep Fast; filter aggressively
Deep nightly SAST CodeQL Deepest open-source data-flow
High-precision pass Joern Low FPR specialty
Reachability CodeQL + SVF (C/C++) + SootUp (Java) Multi-tool fusion
FP filter (Phase 4) LLM agent over SAST output Empirically cuts SAST false positives dramatically

SBOM & Container Provenance

Need Choice
SBOM generation Syft (+ Trivy for breadth)
Vuln matching Grype + Trivy (different DB coverage)
Image signing/attestation Cosign + Sigstore (offline verify for air-gap)
SLSA verify slsa-verifier
Continuous SBOM monitoring Dependency-Track

AI / Agents / Data

Need Choice Notes
Orchestration LangGraph State machine + checkpointing
LLM serving vLLM (prod), Ollama (dev/single-box) Air-gap friendly
Reasoner LLM Llama 3.3 70B or DeepSeek-V3 (quantized) Open-weight
Coder LLM Qwen2.5-Coder-32B or DeepSeek-Coder-V2 Open-weight
Embeddings bge-large-en-v1.5 / nomic-embed-text Self-hosted via TEI
Vector DB Qdrant Single binary, payload filters
Graph DB Neo4j Community (Memgraph/NebulaGraph swap >10M nodes) GDS library
Findings DB PostgreSQL + pgvector ACID + dedup
Blob store MinIO (S3-compatible) Immutable scan output, SBOMs
Bus Kafka / Redpanda Event backbone
Metrics Prometheus + Grafana

Attack-Graph References (study/optional, not vendored)

MulVAL (logical attack graph, polynomial), CAULDRON/TVA (exploit-dependency model), MITRE CALDERA (ATT&CK emulation plug-in), PentAGI / CAI / Strix (open-source agent references — none of which correlate to source code, confirming the gap SynoSec fills).


9. Inter-Service Contracts (APIs & Events)

All services expose gRPC (Protobuf) and publish/consume Kafka events. Representative schemas:

Event: AssetDiscovered (topic discovery.events)

{
  "asset_id": "uuid", "tenant_id": "uuid",
  "ip": "203.0.113.10", "port": 443, "proto": "https",
  "service": "spring-boot", "version": "2.7.4",
  "banner": "...", "tls_jarm": "...", "fingerprint": {"...": "..."},
  "hashes": {"favicon": "..."}, "source_tool": "httpx",
  "timestamp": "2026-06-08T12:00:00Z", "confidence": 0.92
}

Event: VulnerabilityMatched (topic vuln.events)

{
  "asset_id": "uuid", "cve_id": "CVE-2022-22965",
  "cvss": 9.8, "epss": 0.97, "kev": true,
  "version_match": true, "source_tool": "nuclei"
}

Event: CodeFinding (topic sast.events)

{
  "repo": "github.com/octo/spring-app", "commit": "a1b2c3d",
  "file": "src/main/java/com/octo/web/UserController.java", "line": 42,
  "rule_id": "java/spring-disallowed-fields", "sink_kind": "data-binder",
  "taint_sources": ["request.param"], "reachability_status": "reachable",
  "confidence": 0.91
}

gRPC: CorrelationService.Correlate

rpc Correlate(CorrelateRequest) returns (CorrelateResponse);

message CorrelateRequest { string asset_id = 1; string image_digest = 2; }
message CorrelateResponse {
  string repo = 1; string commit = 2;
  float confidence = 3;
  Band band = 4;                    // AUTO_LINK | PROPOSED_LINK | CANDIDATE
  repeated Signal signals = 5;      // evidence trail
  bool provenance_conflict = 6;
}

gRPC: RemediationService.GeneratePatch → returns unified diff + justification + self-grade + verifier result.


10. Security, Posture & Safety Controls

Non-disruptive scanner operation

  • Signed Engagement object declares scope, per-tool rate ceilings, blackout windows, and active/passive mode.
  • Two modes: PASSIVE (banner/DNS/cert-transparency/SBOM only) and ACTIVE_SAFE (port scan with adaptive backoff, Nuclei filtered to safe=true, no exploitation templates).
  • Rate limiting: per-target token bucket; global circuit breaker halves rate if 5xx exceeds threshold.
  • Validation, not exploitation: version match + reachability + optional sandboxed PoC against a customer-provided digital twin — never against production.
  • Change-window/freeze flags honored from the customer CMDB.

LLM security

  • Zero-egress: all inference on-prem; nothing leaves the perimeter.
  • Prompt-injection defense-in-depth: planner/executor isolation, JSON-structured I/O, an input prompt-injection classifier, an outbound egress allowlist on agent containers, and no silent state-mutating tool calls.
  • Sandboxing: every tool runs in an ephemeral container — no host mounts, no cloud credentials, gVisor/Kata kernel isolation, egress restricted to engagement scope.
  • Audit trail: every prompt, tool call, output, graph mutation, and HITL decision hash-chained into an append-only Postgres ledger.
  • Data retention: per-tenant AES-GCM at rest (Vault/KMS). Defaults: scan data 90 days, findings indefinite, LLM I/O 30 days (configurable to ephemeral).

HITL control tiers

Tier Scope Behavior
0 Read-only PASSIVE scans, allowlisted assets Auto-execute
1 ACTIVE_SAFE scans, SAST, draft patch generation Auto-execute + audit
2 PR creation, sub-AUTO_LINK edge commits, sandboxed PoC Approve-before-execute
3 Scope changes, tool-allowlist changes, model upgrades Two-person approval

Scalability

  • Discovery sharding by tenant namespace with Redis distributed token buckets.
  • Path-explosion defense: k-shortest paths on the attack subgraph, never full enumeration; logical-attack-graph model (polynomial in network size).
  • Prioritization scoring: CVSS × EPSS × (1 + reachability_bonus) × asset_criticality × centrality / mitigation_score. EPSS daily-refreshed.
  • Graph scale: Neo4j Community to ~10M nodes; Memgraph/NebulaGraph migration path beyond.
  • LLM cost control: cheap Verifier model for ~80% of checks; reserve the large reasoner for novel/low-confidence correlations; cache RAG retrievals.

11. Development Stages (Build Plan)

Build the deterministic path first. The learned matcher (Phase 4) can lag without breaking the product thesis — deterministic SLSA/OCI/SBOM joins already exceed competitor accuracy.

Stage 1 — Foundation (months 0–3)

Goal: discover → graph, end to end, safely.

  • Stand up Discovery Engine (Naabu/Subfinder/Httpx/Katana/Nuclei) inside a LangGraph orchestrator with strict scope enforcement.
  • Stand up SBOM pipeline (Syft + Grype + Trivy) and Neo4j topology store.
  • Define the property-graph schema (Asset, Service, Software, CVE, Repository, Commit, Image, Finding).
  • Stand up Kafka bus, Postgres findings store, MinIO blob store.
  • Gate: scan a synthetic 1,000-asset environment end-to-end; graph populated; zero production-impact incidents.

Stage 2 — SAST + LLM Remediation Skeleton (months 3–6)

Goal: generate verified patches behind a HITL PR.

  • Integrate Semgrep (PR-time) + CodeQL (nightly).
  • Stand up vLLM + coder model; build RAG over Qdrant with repo embeddings.
  • Build the HITL PR bot (GitHub/GitLab apps).
  • Implement the Patch Verifier (sandboxed Semgrep+CodeQL+tests before PR open).
  • Gate: ≥70% of generated patches pass the Verifier sandbox on a curated corpus; ≥40% accepted by developers in pilot.

Stage 3 — Correlation Engine MVP, Deterministic Only (months 6–9)

Goal: ship the moat using cryptographic + declarative signals.

  • Implement Tier-1 → Tier-6 signals: Sigstore/SLSA verify, OCI label parsing, SBOM-PURL joins, CI metadata, K8s/ArgoCD parsing, Terraform-state parsing.
  • Implement the confidence-scoring engine (noisy-OR + contradiction penalty).
  • Wire AUTO_LINK / PROPOSED_LINK / CANDIDATE gating into PR generation.
  • Gate: ≥95% of AUTO_LINK matches correct on developer review; zero false patches merged.

Stage 4 — Learned Matching + Attack-Path Chaining (months 9–12)

Goal: cover customers lacking provenance; add chaining.

  • Build the two-stage fingerprint matcher (TF-IDF candidates → BERT pairwise) for Java/JS/Python/Go.
  • Implement attack-graph chaining (k-shortest path) with EPSS×CVSS×centrality×reachability scoring.
  • Add the LLM-based SAST false-positive filter.
  • Gate: customer reports "found a path competitors missed" or "explained a CVE with the exact commit" within 30 days post-deploy.

Stage 5 — Verifier Hardening, Validation & Scale (months 12–18)

Goal: production hardening + on-prem learning loop.

  • Verifier agent on every LLM-asserted graph mutation; track hallucination metrics.
  • Sandboxed PoC against customer digital twins (opt-in).
  • Memgraph/NebulaGraph migration option for >10M-node tenants.
  • Per-tenant feedback RAG (accept/reject) closes the learning loop.

Replan triggers:

  • Correlation precision < 90% in Stage 3 → delay Stage 4; invest in declarative signal coverage (enforce OCI labels via admission control).
  • Patch acceptance < 30% in Stage 2 → deepen the Verifier sandbox before scaling.
  • Any scan-related production incident → halt ACTIVE_SAFE, revert to PASSIVE until root cause shipped.
  • Graph query latency > 5s at customer scale → subgraph caching or Memgraph migration before adding enrichment.

12. Repository Layout

synosec/
├── README.md
├── deploy/
│   ├── helm/                      # single air-gapped Helm chart
│   └── docker-compose.dev.yml     # local single-box (Ollama, Neo4j, Qdrant, PG)
├── proto/                         # shared gRPC/Protobuf contracts
├── services/
│   ├── discovery-engine/          # Naabu/Subfinder/Httpx/Katana/BloodHound wrappers
│   ├── vuln-intel/                # Nuclei/Grype/Trivy + local NVD/EPSS/KEV mirror
│   ├── sast-pipeline/             # Semgrep/CodeQL/Joern + reachability fusion
│   ├── llm-orchestrator/          # vLLM client, patch gen, RAG, patch verifier
│   ├── correlation-engine/        # signal collectors, fusion scorer, verifiers
│   ├── orchestration/             # LangGraph planner/executor/verifier graphs
│   ├── reporting-ui/              # React/Next.js + GraphQL gateway + PR bot
│   └── common/                    # auth (mTLS), audit ledger, engagement scope
├── data/
│   ├── neo4j-schema/              # constraints, indexes, GDS projections
│   ├── postgres-migrations/
│   └── qdrant-collections/
├── agents/
│   ├── planner/ executor/ verifier/   # prompts (JSON-schema), tool allowlists
└── test/
    ├── synthetic-lab/             # 1,000-asset env, 6-VM PoC, digital twins
    └── corpora/                   # patch corpus, OWASP Benchmark harness

13. Local Development & Deployment

Single-box dev (docker-compose.dev.yml): Ollama (coder + reasoner), Neo4j Community, Qdrant, PostgreSQL+pgvector, MinIO, Redpanda. Discovery tools as sidecar containers. Bring up:

docker compose -f deploy/docker-compose.dev.yml up -d
make seed-synthetic-lab        # spins up the 1,000-asset test environment
make run-engagement SCOPE=test/synthetic-lab/scope.json

Production (air-gapped): helm install synosec deploy/helm -f values.airgap.yaml. All model weights, vuln DBs (NVD/GHSA/OSV/EPSS/KEV), and container images are mirrored into the customer registry beforehand. No outbound network required at runtime.

GPU sizing: reasoner (70B quantized) ≈ 1–2× A100/H100; coder (32B) ≈ 1× A100; embeddings on CPU or shared GPU. Verifier uses the smaller/cheaper model.


14. Testing & Acceptance Gates

  • Unit + contract tests per service against the Protobuf schemas.
  • Synthetic lab (test/synthetic-lab/) — a controlled 1,000-asset environment with known vulnerabilities and a known repo↔image mapping, used to measure correlation precision/recall.
  • OWASP Benchmark harness — measures SAST + LLM-filter false-positive reduction.
  • Patch corpus — measures Verifier pass rate and developer acceptance.
  • Posture tests — assert that out-of-scope targets are refused, rate ceilings hold, and ACTIVE_SAFE never fires exploitation templates.
  • Hallucination metric — sampled LLM-as-judge spot-check on graph edges and patches; tracked as a top-line product metric.

Definition of done per stage = the stage gate in Section 11.


15. Glossary

Term Meaning
Attack graph Graph of how chained vulnerabilities let an attacker move from an entry point to a target
AUTO_LINK Correlation confidence ≥ 0.95 with a cryptographic signal; safe to patch automatically (still HITL-approved)
Correlation Engine The service that links an external vulnerability to the exact source code
HITL Human-in-the-loop — required human approval before state-changing actions
Non-exploitative validation Confirming exploit feasibility without firing live payloads at production
Provenance (SLSA/in-toto) Cryptographic attestation binding a built artifact to its source commit and builder
ReAct Reasoning + Acting agent loop (think → call tool → observe → repeat)
Reachability Whether a vulnerable code symbol is actually callable from a public entry point
Verifier agent The component that re-checks every LLM-asserted fact deterministically before it enters the graph
Zero-egress No data leaves the customer perimeter; all inference and storage on-prem

This README is the authoritative internal build specification. Build deterministic before learned, validate before exploit, and never let an unverified LLM assertion become a committed fact.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors