Skip to content

Zink0909/Statistics-Knowledge-Graph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Statistical Learning Knowledge Graph

An interactive knowledge graph covering statistical learning concepts, methods, and models — and the relationships between them.

307 concepts · 497 relationships · 6 domains · 8 relationship types


What this is

This project maps the conceptual structure of statistical learning as a directed graph. Every concept (OLS, MLE, confidence intervals, logistic regression...) is a node. Every meaningful relationship between concepts (A requires B, A assumes B, A produces B...) is a typed, directed edge.

The result is a navigable knowledge structure you can explore by concept, by domain, or by relationship type — and update every quarter as new material is covered.


Repository structure

/
├── kg-project/              ← data layer (source of truth)
│   ├── domains/             ← one .txt file per domain, lists all nodes
│   │   ├── probability_theory.txt
│   │   ├── probability_distributions.txt
│   │   ├── statistical_inference.txt
│   │   ├── regression_and_linear_models.txt
│   │   ├── generalized_linear_models.txt
│   │   └── model_evaluation_and_selection.txt
│   ├── edges/               ← one .json file per relationship type
│   │   ├── instance_of.json
│   │   ├── requires.json
│   │   ├── assumes.json
│   │   ├── uses_distribution.json
│   │   ├── measures.json
│   │   ├── produces.json
│   │   ├── corresponds_to.json
│   │   └── implemented_by.json
│   ├── output/              ← auto-generated, do not edit manually
│   │   ├── statistical_learning_kg.json
│   │   └── kg_visualization.html
│   ├── build.py             ← validates and rebuilds the graph
│   ├── update_site.py       ← syncs data into the React app
│   └── README.md            ← data layer docs
│
└── slkg/                    ← presentation layer (React app)
    ├── src/
    │   ├── data/graph.ts    ← auto-generated by update_site.py
    │   ├── pages/           ← Graph, Explore, Concept, Domains, Path, Compare, Edges
    │   ├── components/      ← KGCanvas, NodePanel, Layout, UI components
    │   ├── lib/             ← graphUtils.ts, constants.ts
    │   └── types/           ← TypeScript type definitions
    ├── package.json
    └── vite.config.ts

kg-project owns the data. slkg owns the presentation. update_site.py is the only bridge between them.


Domains

Domain Concepts
Probability Theory 34
Probability Distributions 22
Statistical Inference 107
Regression and Linear Models 92
Generalized Linear Models 16
Model Evaluation and Selection 36

These six domains are designed to cover the full scope of statistical learning. New concepts are always added into an existing domain — no new domains should be needed.


Relationship types

Type Direction Meaning
instance_of A → B A is a specific case or member of B
requires A → B A cannot be defined without B (hard logical dependency)
assumes A → B A is optimal only when B holds; violating B weakens but doesn't break A
uses_distribution A → B A operates under or derives from distribution B
measures A → B A quantifies or diagnoses B
produces A → B Executing A yields B as direct output
corresponds_to A ↔ B A and B are structurally dual or symmetric
implemented_by A → B Abstract concept A is concretely realized through method B

Key distinction — requires vs assumes:

  • requires: B's absence makes A undefined or incoherent (e.g. MLE requires likelihood)
  • assumes: B's violation makes A suboptimal but still computable (e.g. OLS assumes homoscedasticity)

Running locally

Prerequisites

  • Python 3.8+
  • Node.js 18+

React app (slkg)

cd slkg
npm install
npm run dev
# open http://localhost:5173

Standalone HTML visualization (no build step)

# open directly in browser — no server needed
open kg-project/output/kg_visualization.html

Quarterly update workflow

Do this at the start of each new quarter, after covering new material.

Step 1 — Add new nodes

Open the relevant domain file in kg-project/domains/ and append new lines at the bottom (above any # need to audit section if present).

Format:

id | Canonical Name | node_type | structural_role

Rules:

  • id must be snake_case, all lowercase, unique across all domains
  • node_type: one of Concept, Method, Model, Conceptual Organizer
  • structural_role: one of Core, Branch, Subbranch, Leaf

Example (adding to statistical_inference.txt):

variational_inference | Variational Inference | Method | Subbranch
elbo | Evidence Lower Bound | Concept | Leaf
mean_field_approximation | Mean Field Approximation | Method | Leaf

Role guidance:

  • Core — top-level organising concept for a cluster (use sparingly)
  • Branch — mid-level grouping concept
  • Subbranch — intermediate grouping, more specific than Branch
  • Leaf — concrete concept, method, or model (most new nodes will be Leaf)

Step 2 — Validate the new nodes

cd kg-project
python build.py --validate

Fix any errors (duplicate ids, invalid formats) before continuing. Warnings can be left for now.

Step 3 — Generate edges with Claude

Open a new conversation with Claude and send this message:

I'm updating my Statistical Learning Knowledge Graph with new nodes.
Please generate edges for the new nodes listed below.

New nodes:
[paste the lines you added in Step 1]

Use the edge schema:
- instance_of: A is a specific case of B
- requires: A cannot be defined without B
- assumes: A needs B for optimality but works without it
- uses_distribution: A uses distribution B
- measures: A quantifies B
- produces: executing A yields B
- corresponds_to: A and B are dual/symmetric
- implemented_by: abstract A is realized by method B

For each edge, provide JSON in this format:
{
  "source": {"id": "...", "canonical_name": "...", "domain": "..."},
  "target": {"id": "...", "canonical_name": "...", "domain": "..."},
  "edge_type": "...",
  "confidence": 0.0–1.0,
  "generated_by": "llm",
  "notes": ""
}

Run Agent 1 (instance_of), Agent 2 (uses_distribution), Agent 3 (requires + assumes),
Agent 4 (measures + produces), Agent 5 (corresponds_to + implemented_by),
then Agent 6 semantic review.

Claude will return edge JSON grouped by type.

Step 4 — Merge new edges

For each edge type that has new edges, open kg-project/edges/<type>.json and append the new edge objects to the "edges" array.

Example — adding to kg-project/edges/requires.json:

{
  "edge_type": "requires",
  "description": "...",
  "edges": [
    ...existing edges...,
    {
      "source": {"id": "variational_inference", "canonical_name": "Variational Inference", "domain": "Statistical Inference"},
      "target": {"id": "posterior_distribution", "canonical_name": "Posterior Distribution", "domain": "Statistical Inference"},
      "edge_type": "requires",
      "confidence": 0.95,
      "generated_by": "llm",
      "notes": "Variational inference approximates the posterior; requires the posterior concept to be defined."
    }
  ]
}

Step 5 — Rebuild the graph

cd kg-project
python build.py

If there are errors, fix them. Warnings are acceptable (low-confidence edges, borderline cases).

Expected output:

[1/5] Loading nodes ...   307+ nodes loaded, 0 errors
[2/5] Loading edges ...   497+ edges loaded
[3/5] Validating ...      All checks passed.
[4/5] Building graph ...  output/statistical_learning_kg.json
[5/5] Updating viz ...    output/kg_visualization.html
Build complete.

Step 6 — Sync to the React app

python update_site.py

This rewrites slkg/src/data/graph.ts with the new data.

Step 7 — Test locally

cd ../slkg
npm run dev
# open http://localhost:5173
# verify new concepts appear in Graph and Explore pages

Step 8 — Deploy

cd slkg
npm run build
git add .
git commit -m "Q[N] update: add [X] new concepts"
git push
# Vercel detects the push and auto-deploys

Validation rules

python build.py --validate checks all of these automatically:

Code Rule
L1-01 No self-loops (source ≠ target)
L1-02 All domain names are valid (one of the six)
L1-03 All node ids exist in domain list
L1-04 Confidence in [0.0, 1.0]
L1-05 Low-confidence edges (< 0.85) must have a note explaining why
L1-06 generated_by is one of: auto, llm, human
L1-07 edge_type field matches the filename it lives in
L1-08 instance_of edges use generated_by: auto or human, confidence 1.0
L1-09 uses_distribution targets must be in Probability Distributions domain
L1-10 No duplicate (source, target, edge_type) triples across all files
L1-11 Same node pair cannot appear in both requires and assumes
L2-01 corresponds_to edges follow alphabetical source < target convention

Tech stack

Data layer (kg-project)

  • Python 3 — no dependencies beyond the standard library
  • build.py — validation, graph construction, visualization update
  • update_site.py — data sync bridge to React app

Presentation layer (slkg)

  • React 18 + TypeScript
  • Vite (build tool)
  • Tailwind CSS (styling)
  • D3 v7 (force simulation, loaded via CDN)
  • React Router v6 (page routing)

Pages

Page Route Description
Graph /graph Interactive force-directed graph, ego (radial) view on double-click
Explore /explore Browse all concepts with domain/role/type filters
Concept /concept/:id Full detail page for a single concept — all relationships grouped by type
Domains /domains Domain overview with node counts, role breakdowns, cross-domain edge stats
Learning Path /path BFS shortest-path finder between any two concepts
Compare /compare Side-by-side structural comparison of two concepts
Edge Explorer /edges Browse all edges by type, sortable and filterable

About

Interactive knowledge graph for statistical learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors