declared-md index

the indexer that crawls GitHub for declared-md profiles and aggregates them into searchable JSON.

what it does

on each run, the indexer:

searches GitHub for files named whoami.md, whois.md, and whatis.md
filters results to the three canonical locations defined in the spec
validates each file against the JSON schemas
validates owner type (user vs organization) where required
deduplicates files from the same subject by location priority
writes valid profiles to data/people.json, data/orgs.json, and data/projects.json
writes invalid files to invalid/<kind>/ with the validation errors annotated
writes aggregated statistics to data/stats.json

this is a full re-index on each run. there is no incremental cache.

canonical locations

a file is indexed only if it appears in one of these three locations, in priority order:

priority	location	example
1	root of `<owner>/<owner>` repo	`GuilhermeAlbert/GuilhermeAlbert/whoami.md`
2	root of `<owner>/declared` repo	`GuilhermeAlbert/declared/whoami.md`
3	`.github/` directory of any repo	`GuilhermeAlbert/some-repo/.github/whoami.md`

files in other locations are silently skipped. whoami.md is only indexed when the repo owner is a GitHub user. whois.md is only indexed when the repo owner is a GitHub organization.

running locally

requires GITHUB_TOKEN. a personal access token with no extra scopes is sufficient.

cd index/
npm install
npm run build

# dry run to see what would be indexed (no files written)
GITHUB_TOKEN=<your-token> node dist/crawl.js --dry-run --limit 10

# index only one kind
GITHUB_TOKEN=<your-token> node dist/crawl.js --kind whoami --limit 5

# full crawl (writes data/ and invalid/)
GITHUB_TOKEN=<your-token> node dist/crawl.js

flags:

--dry-run runs the full pipeline without writing any output files. prints what would happen.
--limit N stops after N results per kind. useful for testing. set to 0 for no limit.
--kind whoami|whois|whatis crawls only one of the three kinds.

how the workflow works

the GitHub Actions workflow is in .github/workflows/crawl.yml.

it triggers on workflow_dispatch only. to run it, go to the Actions tab in the GitHub repo and dispatch the workflow manually.

inputs:

dry_run (boolean) -- run without committing
limit (string) -- max results per kind, 0 for no limit
kind (string) -- leave empty to crawl all three kinds

the workflow installs dependencies, builds the dist, runs the crawler, and commits any changes to data/ and invalid/. it skips the commit step if there are no changes.

a daily schedule is commented out in the workflow file. uncomment it to enable automatic nightly runs.

reading data files

data/people.json:

{
  "version": "1.0",
  "indexed_at": "2026-04-27T00:00:00Z",
  "count": 42,
  "profiles": [
    {
      "handle": "guilhermealbert",
      "name": "Guilherme Albert",
      "headline": "...",
      "frontmatter": { ... },
      "body_preview": "first 200 chars of the body",
      "source": {
        "repo": "GuilhermeAlbert/GuilhermeAlbert",
        "path": "whoami.md",
        "url": "https://github.com/GuilhermeAlbert/GuilhermeAlbert/blob/main/whoami.md",
        "raw_url": "https://raw.githubusercontent.com/GuilhermeAlbert/GuilhermeAlbert/main/whoami.md",
        "location_type": "profile_repo"
      },
      "indexed_at": "2026-04-27T00:00:00Z"
    }
  ]
}

data/orgs.json and data/projects.json follow the same schema.

data/stats.json contains aggregated counts: totals, by_location_type, by_country, top_tags, and top_languages.

debugging invalid files

files that were discovered but failed validation are stored in invalid/<kind>/. each file is the original content with an HTML comment at the top:

<!--
Source: owner/repo/path
Discovered: 2026-04-27T00:00:00Z
Validation errors:
  - handle: must be 2-39 characters, lowercase letters, numbers, and hyphens only
  - links.github: must be a valid GitHub URL (e.g. https://github.com/username)
-->

[original file content here]

these files are committed to the repo so the history of validation failures is visible over time.

re-indexing policy

the indexer is stateless. data/ and invalid/ are derived from what is currently on GitHub. if you need to revert, run again: the index will reflect the current state of all public repos.

if a previously valid profile is updated to be invalid, it moves from data/ to invalid/ on the next run.

note on reference profiles

the maintainer profile (guilhermealbert), studio profile (treblahq), and project profile (declared-md) are in declared-md/reference/ for documentation purposes only. that path is not a canonical location.

to have these profiles appear in the index, publish them to their canonical locations:

GuilhermeAlbert/GuilhermeAlbert/whoami.md (profile repo) or GuilhermeAlbert/declared/whoami.md (declared repo)
treblahq/treblahq/whois.md (profile repo for the org) or treblahq/declared/whois.md
declared-md/declared/whatis.md or .github/whatis.md in any declared-md org repo

how validation works

the indexer re-implements validation using AJV directly against the schemas in spec/schemas/. it does not call the published declared-md npm package. this keeps the indexer self-contained and avoids a runtime dependency on a CLI binary.

schemas are copied from spec/schemas/ to src/schemas/ at build time. src/schemas/ is gitignored.

development

npm install          # install dependencies
npm run build        # compile to dist/crawl.js
npm test             # run tests (no API calls, all mocked)
npm run test:cov     # run tests with coverage report

tests live in tests/ and use vitest. all GitHub API calls are mocked. no token is needed for tests.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
data		data
invalid		invalid
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

declared-md index

what it does

canonical locations

running locally

how the workflow works

reading data files

debugging invalid files

re-indexing policy

note on reference profiles

how validation works

development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

declared-md index

what it does

canonical locations

running locally

how the workflow works

reading data files

debugging invalid files

re-indexing policy

note on reference profiles

how validation works

development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages