Skip to content

Fix broken docs links and add CI link checker#116

Merged
CallumMcMahon merged 5 commits intomainfrom
fix/broken_links
Feb 11, 2026
Merged

Fix broken docs links and add CI link checker#116
CallumMcMahon merged 5 commits intomainfrom
fix/broken_links

Conversation

@CallumMcMahon
Copy link
Member

@CallumMcMahon CallumMcMahon commented Feb 9, 2026

Summary

  • Fix broken relative links across all guide pages: CSV download links now point to GitHub blob URLs, notebook links point to converted HTML docs pages, and .md extensions are dropped from internal reference links
  • Add a link checker script (docs-site/scripts/check-links.py) that parses all HTML in the static build output and verifies internal links resolve to existing pages
  • Integrate the check into the deploy-docs CI workflow, running after build and before deploy

Pages fixed

  • fuzzy-join-without-keyscompany_info.csv, valuations.csv, notebook link
  • add-column-web-lookupsaas_products.csv
  • classify-dataframe-rows-llmhn_jobs.csv
  • filter-dataframe-with-llmhn_jobs.csv, reference/SCREEN.md
  • resolve-entities-pythoncase_01_crm_data.csv, notebook link

Why a custom script instead of lychee?

lychee has a parser bug where links appearing after <pre><code> blocks are silently dropped. Since every docs page has code examples, this makes lychee unreliable for this site.

Test plan

  • python docs-site/scripts/check-links.py passes with 0 broken links across 29 pages
  • Verified the script correctly detects the original broken links when reverted
  • CI workflow runs successfully on this PR

🤖 Generated with Claude Code

CallumMcMahon and others added 3 commits February 9, 2026 11:04
CSV links now point to GitHub instead of serving raw files from the
docs site, and notebook links point to the converted HTML docs page.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Same pattern as fuzzy-join-without-keys: CSV links now point to GitHub,
notebook links point to the converted HTML docs pages, and the .md
extension is dropped from internal reference links.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a Python script that parses all HTML files in the static build
output and verifies internal links resolve to existing pages. Runs
after build in the deploy-docs workflow to catch broken links before
deployment.

Note: lychee was evaluated but has a parser bug where links after
<pre><code> blocks are silently dropped, making it unreliable for
docs sites with code examples.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@CallumMcMahon CallumMcMahon changed the title Fix broken links in fuzzy join guide Fix broken docs links and add CI link checker Feb 9, 2026
- Extract links from <meta> and <link> tags (catches og:image 404s)
- Check GitHub blob/main links by verifying the local file exists
- SKIPPED_URLS is now per-URL instead of per-domain, so new links
  to an already-skipped domain still get flagged
- CHECKED_DOMAINS remains domain-level for our own properties
- Deduplicate links per page

Known failure: everyrow.io/everyrow-og.png is 404 on every page.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
```

The dataset is a list of 246 SaaS and developer tools like Slack, Notion, Asana. Download [saas_products.csv](data/saas_products.csv) to follow along. We find the annual price of each product's lowest paid tier, which isn't available through any structured API; it requires visiting pricing pages that change frequently and present information in different formats.
The dataset is a list of 246 SaaS and developer tools like Slack, Notion, Asana. Download [saas_products.csv](https://github.com/futuresearch/everyrow-sdk/blob/main/docs/data/saas_products.csv) to follow along. We find the annual price of each product's lowest paid tier, which isn't available through any structured API; it requires visiting pricing pages that change frequently and present information in different formats.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we use LFS to store CSVs, I don't see any data when I go to that URL. Should we link to the raw content instead, in this case https://media.githubusercontent.com/media/futuresearch/everyrow-sdk/refs/heads/main/docs/data/saas_products.csv?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh nice, I hadn't thought of that! let me change it over

Copy link
Contributor

@hnykda hnykda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

@CallumMcMahon CallumMcMahon merged commit fddae60 into main Feb 11, 2026
2 checks passed
@CallumMcMahon CallumMcMahon deleted the fix/broken_links branch February 11, 2026 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants