Skip to content

declared-md/index

Repository files navigation

declared-md index

the indexer that crawls GitHub for declared-md profiles and aggregates them into searchable JSON.


what it does

on each run, the indexer:

  1. searches GitHub for files named whoami.md, whois.md, and whatis.md
  2. filters results to the three canonical locations defined in the spec
  3. validates each file against the JSON schemas
  4. validates owner type (user vs organization) where required
  5. deduplicates files from the same subject by location priority
  6. writes valid profiles to data/people.json, data/orgs.json, and data/projects.json
  7. writes invalid files to invalid/<kind>/ with the validation errors annotated
  8. writes aggregated statistics to data/stats.json

this is a full re-index on each run. there is no incremental cache.


canonical locations

a file is indexed only if it appears in one of these three locations, in priority order:

priority location example
1 root of <owner>/<owner> repo GuilhermeAlbert/GuilhermeAlbert/whoami.md
2 root of <owner>/declared repo GuilhermeAlbert/declared/whoami.md
3 .github/ directory of any repo GuilhermeAlbert/some-repo/.github/whoami.md

files in other locations are silently skipped. whoami.md is only indexed when the repo owner is a GitHub user. whois.md is only indexed when the repo owner is a GitHub organization.


running locally

requires GITHUB_TOKEN. a personal access token with no extra scopes is sufficient.

cd index/
npm install
npm run build

# dry run to see what would be indexed (no files written)
GITHUB_TOKEN=<your-token> node dist/crawl.js --dry-run --limit 10

# index only one kind
GITHUB_TOKEN=<your-token> node dist/crawl.js --kind whoami --limit 5

# full crawl (writes data/ and invalid/)
GITHUB_TOKEN=<your-token> node dist/crawl.js

flags:

  • --dry-run runs the full pipeline without writing any output files. prints what would happen.
  • --limit N stops after N results per kind. useful for testing. set to 0 for no limit.
  • --kind whoami|whois|whatis crawls only one of the three kinds.

how the workflow works

the GitHub Actions workflow is in .github/workflows/crawl.yml.

it triggers on workflow_dispatch only. to run it, go to the Actions tab in the GitHub repo and dispatch the workflow manually.

inputs:

  • dry_run (boolean) -- run without committing
  • limit (string) -- max results per kind, 0 for no limit
  • kind (string) -- leave empty to crawl all three kinds

the workflow installs dependencies, builds the dist, runs the crawler, and commits any changes to data/ and invalid/. it skips the commit step if there are no changes.

a daily schedule is commented out in the workflow file. uncomment it to enable automatic nightly runs.


reading data files

data/people.json:

{
  "version": "1.0",
  "indexed_at": "2026-04-27T00:00:00Z",
  "count": 42,
  "profiles": [
    {
      "handle": "guilhermealbert",
      "name": "Guilherme Albert",
      "headline": "...",
      "frontmatter": { ... },
      "body_preview": "first 200 chars of the body",
      "source": {
        "repo": "GuilhermeAlbert/GuilhermeAlbert",
        "path": "whoami.md",
        "url": "https://github.com/GuilhermeAlbert/GuilhermeAlbert/blob/main/whoami.md",
        "raw_url": "https://raw.githubusercontent.com/GuilhermeAlbert/GuilhermeAlbert/main/whoami.md",
        "location_type": "profile_repo"
      },
      "indexed_at": "2026-04-27T00:00:00Z"
    }
  ]
}

data/orgs.json and data/projects.json follow the same schema.

data/stats.json contains aggregated counts: totals, by_location_type, by_country, top_tags, and top_languages.


debugging invalid files

files that were discovered but failed validation are stored in invalid/<kind>/. each file is the original content with an HTML comment at the top:

<!--
Source: owner/repo/path
Discovered: 2026-04-27T00:00:00Z
Validation errors:
  - handle: must be 2-39 characters, lowercase letters, numbers, and hyphens only
  - links.github: must be a valid GitHub URL (e.g. https://github.com/username)
-->

[original file content here]

these files are committed to the repo so the history of validation failures is visible over time.


re-indexing policy

the indexer is stateless. data/ and invalid/ are derived from what is currently on GitHub. if you need to revert, run again: the index will reflect the current state of all public repos.

if a previously valid profile is updated to be invalid, it moves from data/ to invalid/ on the next run.


note on reference profiles

the maintainer profile (guilhermealbert), studio profile (treblahq), and project profile (declared-md) are in declared-md/reference/ for documentation purposes only. that path is not a canonical location.

to have these profiles appear in the index, publish them to their canonical locations:

  • GuilhermeAlbert/GuilhermeAlbert/whoami.md (profile repo) or GuilhermeAlbert/declared/whoami.md (declared repo)
  • treblahq/treblahq/whois.md (profile repo for the org) or treblahq/declared/whois.md
  • declared-md/declared/whatis.md or .github/whatis.md in any declared-md org repo

how validation works

the indexer re-implements validation using AJV directly against the schemas in spec/schemas/. it does not call the published declared-md npm package. this keeps the indexer self-contained and avoids a runtime dependency on a CLI binary.

schemas are copied from spec/schemas/ to src/schemas/ at build time. src/schemas/ is gitignored.


development

npm install          # install dependencies
npm run build        # compile to dist/crawl.js
npm test             # run tests (no API calls, all mocked)
npm run test:cov     # run tests with coverage report

tests live in tests/ and use vitest. all GitHub API calls are mocked. no token is needed for tests.

About

Official data indexer for declared-md.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors