Skip to content

HTML Parsing

Ferran Buireu edited this page Jun 13, 2026 · 2 revisions

HTML Parsing

Once the contributions HTML is fetched (see Fetching Contributions), ContribKit extracts the data with a handful of regexes over the rendered page. The parser lives in infrastructure/github/github-html-contributions-repository.ts.

This is deliberately the only place that knows GitHub's HTML structure. If GitHub changes the markup, only these patterns need updating.


What the page contains

GitHub renders each day as a <td> carrying data attributes, and exposes the exact count through a separate <tool-tip> element linked by id:

<td class="ContributionCalendar-day" id="contribution-day-component-1-2"
    data-date="2024-01-02" data-level="2"></td>
...
<tool-tip for="contribution-day-component-1-2">4 contributions on January 2nd.</tool-tip>

The regexes

Pattern Captures
TD_REGEX each contribution-day <td>'s attribute string
DATE_REGEX data-date="YYYY-MM-DD"
LEVEL_REGEX data-level="0..4"
ID_REGEX the <td>'s id
TIP_REGEX each <tool-tip for="…">N → maps id → exact count

The two-pass parse

  1. Cells — iterate every contribution-day <td>, pulling date, level, and id. A cell is kept only when it has both a date and a level.
  2. Tooltips — iterate every <tool-tip> and build a Map<id, count>.
  3. Enrich — for each day, attach the exact count by looking up its id in the map; level is run through clampLevel to guarantee it's in 04. Days whose id isn't in the map (or that have no id) get count: null.
  4. Total — if any counts were found, sum them; otherwise total is null.

The result is { days, total }, where each day is { date, level, count }.


Failure behavior

If the pass produces zero days, the repository returns parse("Could not parse contributions") rather than an empty (and misleading) calendar. That typically means GitHub changed the page structure — see Troubleshooting.


See also

Clone this wiki locally