the indexer that crawls GitHub for declared-md profiles and aggregates them into searchable JSON.
on each run, the indexer:
- searches GitHub for files named
whoami.md,whois.md, andwhatis.md - filters results to the three canonical locations defined in the spec
- validates each file against the JSON schemas
- validates owner type (user vs organization) where required
- deduplicates files from the same subject by location priority
- writes valid profiles to
data/people.json,data/orgs.json, anddata/projects.json - writes invalid files to
invalid/<kind>/with the validation errors annotated - writes aggregated statistics to
data/stats.json
this is a full re-index on each run. there is no incremental cache.
a file is indexed only if it appears in one of these three locations, in priority order:
| priority | location | example |
|---|---|---|
| 1 | root of <owner>/<owner> repo |
GuilhermeAlbert/GuilhermeAlbert/whoami.md |
| 2 | root of <owner>/declared repo |
GuilhermeAlbert/declared/whoami.md |
| 3 | .github/ directory of any repo |
GuilhermeAlbert/some-repo/.github/whoami.md |
files in other locations are silently skipped. whoami.md is only indexed when the repo owner is a GitHub user. whois.md is only indexed when the repo owner is a GitHub organization.
requires GITHUB_TOKEN. a personal access token with no extra scopes is sufficient.
cd index/
npm install
npm run build
# dry run to see what would be indexed (no files written)
GITHUB_TOKEN=<your-token> node dist/crawl.js --dry-run --limit 10
# index only one kind
GITHUB_TOKEN=<your-token> node dist/crawl.js --kind whoami --limit 5
# full crawl (writes data/ and invalid/)
GITHUB_TOKEN=<your-token> node dist/crawl.jsflags:
--dry-runruns the full pipeline without writing any output files. prints what would happen.--limit Nstops after N results per kind. useful for testing. set to 0 for no limit.--kind whoami|whois|whatiscrawls only one of the three kinds.
the GitHub Actions workflow is in .github/workflows/crawl.yml.
it triggers on workflow_dispatch only. to run it, go to the Actions tab in the GitHub repo and dispatch the workflow manually.
inputs:
dry_run(boolean) -- run without committinglimit(string) -- max results per kind, 0 for no limitkind(string) -- leave empty to crawl all three kinds
the workflow installs dependencies, builds the dist, runs the crawler, and commits any changes to data/ and invalid/. it skips the commit step if there are no changes.
a daily schedule is commented out in the workflow file. uncomment it to enable automatic nightly runs.
data/people.json:
{
"version": "1.0",
"indexed_at": "2026-04-27T00:00:00Z",
"count": 42,
"profiles": [
{
"handle": "guilhermealbert",
"name": "Guilherme Albert",
"headline": "...",
"frontmatter": { ... },
"body_preview": "first 200 chars of the body",
"source": {
"repo": "GuilhermeAlbert/GuilhermeAlbert",
"path": "whoami.md",
"url": "https://github.com/GuilhermeAlbert/GuilhermeAlbert/blob/main/whoami.md",
"raw_url": "https://raw.githubusercontent.com/GuilhermeAlbert/GuilhermeAlbert/main/whoami.md",
"location_type": "profile_repo"
},
"indexed_at": "2026-04-27T00:00:00Z"
}
]
}data/orgs.json and data/projects.json follow the same schema.
data/stats.json contains aggregated counts: totals, by_location_type, by_country, top_tags, and top_languages.
files that were discovered but failed validation are stored in invalid/<kind>/. each file is the original content with an HTML comment at the top:
<!--
Source: owner/repo/path
Discovered: 2026-04-27T00:00:00Z
Validation errors:
- handle: must be 2-39 characters, lowercase letters, numbers, and hyphens only
- links.github: must be a valid GitHub URL (e.g. https://github.com/username)
-->
[original file content here]these files are committed to the repo so the history of validation failures is visible over time.
the indexer is stateless. data/ and invalid/ are derived from what is currently on GitHub. if you need to revert, run again: the index will reflect the current state of all public repos.
if a previously valid profile is updated to be invalid, it moves from data/ to invalid/ on the next run.
the maintainer profile (guilhermealbert), studio profile (treblahq), and project profile (declared-md) are in declared-md/reference/ for documentation purposes only. that path is not a canonical location.
to have these profiles appear in the index, publish them to their canonical locations:
GuilhermeAlbert/GuilhermeAlbert/whoami.md(profile repo) orGuilhermeAlbert/declared/whoami.md(declared repo)treblahq/treblahq/whois.md(profile repo for the org) ortreblahq/declared/whois.mddeclared-md/declared/whatis.mdor.github/whatis.mdin any declared-md org repo
the indexer re-implements validation using AJV directly against the schemas in spec/schemas/. it does not call the published declared-md npm package. this keeps the indexer self-contained and avoids a runtime dependency on a CLI binary.
schemas are copied from spec/schemas/ to src/schemas/ at build time. src/schemas/ is gitignored.
npm install # install dependencies
npm run build # compile to dist/crawl.js
npm test # run tests (no API calls, all mocked)
npm run test:cov # run tests with coverage reporttests live in tests/ and use vitest. all GitHub API calls are mocked. no token is needed for tests.