Skip to content

ds17f/annotatedDead

Repository files navigation

Annotated Grateful Dead Lyrics — Faithful Mirror

A self-contained, offline, faithful preservation of David Dodd's The Annotated Grateful Dead Lyrics — the 1990s UC Santa Cruz site (artsites.ucsc.edu/GDead/agdl/), recovered from the Internet Archive and made fully browsable on its own, with the period HTML preserved byte-for-byte.

Live site: https://annotated.thedeadly.app/

The original site is frozen/offline. This project rebuilds it from a single archive.org snapshot (timestamp 20230806233010) and fixes the links so it works without the dead live domain.

The live successors of the project, for reference: Grateful Dead Archive Online and the book The Complete Annotated Grateful Dead Lyrics.


Quick start

The only prerequisite is uv.

make all          # fetch the archive, build the full site, audit it, and serve

On a fresh clone the raw archive (mirror/) is not present — it is not committed to this repo (see Status & self-hosting below). make all sees it is missing and runs the one-time ~40-minute Wayback crawl (make mirror) to recreate it, then builds dist/, audits link health, and serves the full site at http://localhost:8000. The crawl only happens once: the mirror is cached on disk, so later make runs skip it. uv run provisions dependencies on first use, so a separate make install isn't required.

Individual targets:

make dist         # build the full site (annotations + lyrics) into dist/
make safe         # build dist/, then strip lyrics for safe public hosting
make serve-dist   # serve dist/ at http://localhost:8000
make audit        # report link health of dist/
make mirror       # re-crawl the archive into mirror/ (~30-45 min; only to refresh)

To preview the safe (annotation-only) build locally, run make safe then make serve-dist. If a server is already running, make safe rewrites dist/ in place and a browser refresh shows the stripped version — no restart needed.

You don't strictly need a server — dist/ is plain static HTML. After a build you can just open dist/index.html (or dist/gdhome.html) with a file:// URL in a browser and click through.


How it works: source of truth → build

The core design decision is to decouple downloading from converting:

archive.org ──mirror──► mirror/ ──build──► dist/ ──serve/audit──► browser
              (raw,                (link-fixed,
               byte-exact,          generated,
               immutable)           disposable)
  • mirror/ is the source of truth: the original HTML and image assets, saved exactly as the archive served them. No conversion, no link rewriting. Once you have it, build outputs can be regenerated forever with zero network. It is gitignored / not committed to this public repo (it contains the copyrighted lyrics); recreate it with make mirror.
  • dist/ is a build artifact: mirror/ plus link fixes. It's gitignored and rebuilt by make dist.

Keeping these separate is the whole point: downloading once into an immutable mirror/ means the browsable output can be regenerated any number of ways without ever re-hitting archive.org.


Status & self-hosting: full vs. safe builds

David Dodd generously gave permission to host his annotations and essays. He did not (and could not) license the underlying song lyrics, which are separately copyrighted. So this project distinguishes two builds:

Build Command Contains Who it's for
Full make dist annotations and lyrics local self-hosters with a lawful source
Safe make safe annotations only; lyrics replaced with a link to dead.net/songs the public site we deploy

make safe runs the normal build and then a standalone pass (scripts/safe_build.py) that strips each song's verbatim lyric block — the <blockquote> between the song's credit line and the first annotation anchor — and drops a link to the official lyrics source in its place. Everything else is preserved: the essays, and the public-domain poems, dictionary entries, and reader correspondence quoted within the annotations. Short lyric fragments quoted inline for commentary in the essays are left intact (permitted annotation / fair use); only the full per-song lyric reproductions are removed. The pass is idempotent and byte-preserving, and it never deletes a byte of the annotation section even on pages with malformed 1990s markup.

Because the full build embeds those lyrics, the raw mirror/ is kept out of this public repository entirely — it is gitignored, and CI sources it from a separate private mirror repo (via the MIRROR_DEPLOY_KEY secret) purely to produce the safe deploy. Nothing public, here or in CI artifacts, contains the verbatim lyrics.

The public site at https://annotated.thedeadly.app/ is the safe build — CI runs make safe before deploying. To build the full site locally, run make mirror once to fetch a lawful copy of the original HTML from the Internet Archive, then make dist.


The pipeline, pass by pass (with real results)

1. make mirror — the raw crawl (scripts/mirror.py)

A breadth-first crawler starting at gdhome.html, following internal links.

Key techniques learned along the way:

  • The Wayback id_ form. Fetching https://web.archive.org/web/20230806233010id_/<original-url> returns the original bytes with no Wayback toolbar or rewritten links injected — exactly what a faithful mirror needs.
  • Link canonicalization. The source links to itself inconsistently: artsites.ucsc.edu/GDead/agdl/… and arts.ucsc.edu/gdead/agdl/… (different host, different case). archive.org normalizes these, so every internal link is reduced to one canonical relative path and fetched/stored exactly once.
  • Throttle resilience. archive.org drops connections under burst load. The first full run got cut off; the crawler now retries connection failures with exponential backoff (make mirror-retry re-queues anything still failed), while letting real 404s fail fast. State is saved every few pages, so the crawl is interruptible and resumable.

Results:

Run Succeeded Failed Notes
Pass 1 (no backoff) 158 108 only 3 real 404s — the rest were throttling
Pass 2 (--retry-failed, with backoff) 308 15 the 15 are all non-content (below)

Final mirror: ~308 resources, ~5.6 MB — 198 HTML pages + 110 images/assets. A cross-check against the project's known page list found every catalogued page present, plus 3 the previous attempt had missed.

The 15 "failures" are all unrecoverable-by-design: typo'd/dead links whose real target was mirrored under the correct name (mexicali.htmlmex.html, etc.), parse fragments from malformed source HTML, and external domains.

2. make dist — the cleanup build (scripts/build_site.py)

Reads mirror/, writes dist/. HTML is edited via a lossless latin-1 round-trip so only href/src URL values change — every other byte of the original markup is preserved. Each fix is an explicit, counted pass:

Pass What it does Rewrites
0 repair malformed source tags (<a href="a href="deal.html"deal.html) 5
1 abs-agdl → relative (http://…/agdl/david.htmldavid.html) 408
2 typo'd internal link → real page (mexicali.htmlmex.html) 10
3 root-absolute → relative (/scarlet.htmlscarlet.html) 3
4 bare domain → add scheme (www.gdhour.comhttp://www.gdhour.com, still live) 1
5 known-dead destination → link-gone.html (alternatives page) 3
6 repair broken #anchor typos (#workingmans#workingman) 17

Pass 0 fixes five specific broken-link defects in the source — a missing quote, a missing space, a pasted-twice href — each with an unambiguous intended target, applied as exact literal replacements.

The biggest win is pass 1: 407 cross-page links pointed at the dead live domain even though their targets sit right in the mirror.

3. make audit — proving it (scripts/audit_links.py)

A browser-accurate link check: every link resolved relative to its page, case-sensitively (like a Linux filesystem), against the files actually present. In-page #anchors are validated too.

Final audit of dist/:

Result Count
Internal links that resolve 3,978
External links (left as-is, not verified) 1,363
In-page anchors / mailto 1,699
Case-mismatches (break on Linux, work on macOS) 0
Real broken internal links 0
Malformed-source fragments (not real links) 1
Broken in-page anchors (preserved from source) 43

make audit exits non-zero only on real broken internal links or case-mismatches, so it can gate a build.


The "link has gone quiet" page

Some links the 1998 site pointed at have moved or died. Rather than leave them as silent dead-ends, known-dead destinations are redirected to a generated link-gone.html that offers a still-working substitute where one exists:

  • jazzisdead.comJazz Is Dead (band) — Wikipedia (the original domain now hosts an unrelated label)
  • www.halcyon.com/wardk/… → no known replacement (a retired personal page)

This list lives in ALT_LINKS in scripts/build_site.py and is easy to extend.

A bug worth remembering: the alt-links page was first named gone.html — which silently overwrote the real "He's Gone" song page (gone.html). It's now link-gone.html, and the build has a hard guard that aborts if a generated filename ever collides with a real mirrored page.


Known limitations (deliberately preserved)

These are defects in the original 1990s source, kept rather than invented around:

  • 1 malformed-link fragment remains: ripple.html's (Chorus) link points at a chorus target that was never created, so there's nothing to repair it to. (Five other malformed links with clear intended targets are fixed in pass 0.)
  • 43 broken in-page anchors (e.g. biblio.html#goose) point at anchors the author linked to but never created. The 17 that were obvious typos are auto-repaired; the rest are left honest.
  • External links are not liveness-checked. ~1,360 of them, mostly to long-gone 1990s sites — not this project's content to fix.

Project layout

mirror/              # raw byte-exact archive copy — source of truth (gitignored;
                     #   not committed; recreate with `make mirror`)
dist/                # built, link-fixed site (gitignored; regenerate with `make dist`)
scripts/
  mirror.py          # the raw crawler  (make mirror / mirror-retry)
  build_site.py      # the cleanup build (make dist)
  safe_build.py      # the lyric-strip pass for public hosting (make safe)
  audit_links.py     # the link auditor  (make audit)
  release.sh         # tag a semver release (make release)
.github/workflows/   # CI, Pages deploy, release automation
Makefile             # all commands — run `make help`
.mirror_state/       # crawler resume state (gitignored)

Run make help for the full target list.


Hosting & releases

The site is hosted on GitHub Pages and deploys automatically:

  • Every merge to main runs CI (build + link audit) and, on success, publishes the site to Pages (.github/workflows/deploy-pages.yml). The deploy runs make safe, so the annotation-only site is what goes live. CI checks out mirror/ from the private mirror repo (via MIRROR_DEPLOY_KEY), so no archive.org crawl happens in CI.
  • Releases are semver-tagged. make release reads Conventional Commits since the last v* tag, picks the next version, and pushes a vX.Y.Z tag. That triggers release.yml, which builds the safe site, attaches a lyric-free dist.zip, and publishes a GitHub Release. Preview first with make release-dryrun.

Contributing

main is protected — all changes go through a pull request with passing CI. The cardinal rule: never hand-edit mirror/ or dist/ — express link and content fixes as code in scripts/build_site.py (HTML_FIXES, REDIRECTS, ALT_LINKS, anchor repair), so they're repeatable and reviewable.

See CONTRIBUTING.md for the workflow and AGENTS.md for the full conventions (also what AI agents follow).


Content source & credits

Content is The Annotated Grateful Dead Lyrics by David Dodd, originally hosted at UC Santa Cruz, recovered from the Internet Archive. This is a preservation effort; all original authorship and attribution remain with David Dodd and the contributors credited throughout the pages.

About

David Dodd's Incredible Work for Annotating Grateful Dead Lyrics

Topics

Resources

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors