A static website and build pipeline that consolidates markdown-based lesson documents from multiple GitHub repositories into one searchable, browsable, AI-readable lessons library. Includes a RAG-powered chatbot, voice reader for lesson pages, and corpus gap detection. Deployed via GitHub Pages at lessons.johnboen.com.
data/repos.yml → harvest_lessons.py → [clone repos to tmp/repos/]
→ parse docs/lessons/*.md → normalize → generated JSON + exports
→ validate_lessons.py → Astro build → Pagefind index → GitHub Pages
ChatPanel.astro → POST /api/chat → FastAPI Backend
→ RAG retrieval → grounded LLM answer with citations
→ gap detection → GitHub discovery → candidate lesson extraction
- Source repos own their lessons at
docs/lessons/*.mdwith optional YAML frontmatter - Hub repo owns the registry (
data/repos.yml), harvesting, validation, rendering, and deployment - Generated JSON in
src/content/generated/drives all Astro pages - Export packs in
public/exports/provide AI-readable lesson data - FastAPI backend provides RAG chatbot, gap detection, and GitHub discovery (runs independently of the static site)
- Voice reader on lesson and about pages using the Web Speech API (no external dependencies)
- Full-text search via Pagefind (static, no backend required)
- RAG chatbot — ask questions, get grounded answers citing specific lessons
- Voice reader — browser-native text-to-speech with paragraph highlighting, section skip, and keyboard shortcuts
- Gap detection — queries the corpus can't answer create trackable gap records
- GitHub discovery — gaps produce candidate external repos, scored and ranked
- Multi-cloud deployment — AWS (Bedrock + OpenSearch), Azure (OpenAI + AI Search), GCP (Vertex AI)
- CI/CD — pytest, ruff, corpus validation; staging/production split with approval gates
Each participating repository stores lessons in docs/lessons/*.md. Each lesson is a standalone markdown document. Subdirectories are supported (e.g., docs/lessons/phase1/*.md).
Lessons may include YAML frontmatter:
---
title: My Lesson Title
summary: One-line summary
date: 2025-01-15
phase: implementation
lesson_type: architecture
status: active
tags: [python, testing, ci-cd]
---Required (after normalization): title (can be inferred from H1 or filename).
Recommended: summary, date, tags, phase, lesson_type.
- Edit
data/repos.ymland add an entry:
- id: my-project
name: My Project
owner: github-username
repo: repo-name
branch: main
lessons_path: docs/lessons
project_url: https://github.com/username/repo
enabled: true- Run
npm run harvestto test - Run
npm run validate:lessonsto check for issues - Commit the updated
data/repos.yml
See docs/adding-a-repo.md for the full procedure.
- Node.js 20+
- Python 3.11+
- Git
- Ollama (for RAG chatbot, optional)
npm install
pip install -e backend[dev]npm run dev # Astro dev server (localhost:4331)
npm run harvest # Clone repos and generate JSON
npm run validate:lessons # Validate harvested data
npm run build # Astro build
npm run index # Pagefind indexing
npm run build:full # Full pipeline: harvest → validate → corpus → build → index
npm run backend # Start FastAPI backend (localhost:8011)
npm run corpus # Build RAG corpus from lessons.json
npm run validate:corpus # Validate RAG corpuspython -m pytest tests/ # Project tests (76)
python -m pytest backend/tests/ # Backend tests (138)
ruff check backend/ # Lint
ruff format --check backend/ # Format check
npm run test:e2e:links # Link crawl (Playwright, requires running site)
npm run test:e2e:smoke # Smoke tests (Playwright)This project uses non-default ports to allow concurrent development across multiple repos:
| Service | Port |
|---|---|
| Astro dev server | 4331 |
| FastAPI backend | 8011 |
The site deploys automatically via GitHub Actions on:
- Push to
main - Manual workflow dispatch
- Daily schedule (6:00 UTC)
The workflow runs: checkout → Python/Node setup → lint → test → harvest → validate → corpus → build → Pagefind index → deploy to GitHub Pages.
For private repos, set the LESSONS_REPO_TOKEN secret in the repository settings.
AWS, Azure, and GCP deployment workflows are available via manual dispatch. Each uses OIDC/Workload Identity Federation for keyless CI/CD auth.
| Cloud | Backend | LLM | Vector Store | Workflow |
|---|---|---|---|---|
| AWS | ECS Fargate | Bedrock (Claude 3 Haiku) | OpenSearch Serverless | deploy-aws.yml |
| Azure | Container Apps | Azure OpenAI (gpt-4o-mini) | Azure AI Search | deploy-azure.yml |
| GCP | Cloud Run | Vertex AI (Gemini 1.5 Flash) | Vertex Vector Search | deploy-gcp.yml |
Infrastructure templates: infra/aws/cloudformation.yml, infra/azure/main.bicep, infra/gcp/deploy.sh.
Active reference documents in docs/:
| Document | Purpose |
|---|---|
| architecture.md | System architecture, V1/V2 data flow, repo treatment |
| PDR.md | V1 product design requirements |
| PDR_V2.md | V2 product design requirements (RAG, gaps, discovery, cloud) |
| project_walkthrough.md | End-to-end system walkthrough (rendered as the About page) |
| lesson-schema.md | Frontmatter schema, controlled vocabularies, ID rules |
| lesson-template.md | Markdown template for writing new lessons |
| adding-a-repo.md | How to add a source repository to the registry |
Completed plans, historical reviews, and superseded specs are in docs/archive/.
After build, the following AI-readable exports are available at /exports/:
lessons-pack.json— full normalized lesson recordslessons-index.json— compact records (id, title, repo, summary, tags, url)lessons-pack.md— all lessons in one markdown document, grouped by repo
Validation uses two severity levels:
- ERROR (build fails): missing registry, duplicate IDs, empty content, invalid JSON
- WARNING (build continues): missing summary/date/tags, unknown lesson types, short content