-
Notifications
You must be signed in to change notification settings - Fork 924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Link checker sometimes flaky #1056
Comments
Should we add a slight delay in between requests? If we do too much, we get a 429 like on your example :/ |
Random brainstorming, don't know how easy or difficult they are to implement in current code:
If throttling works that might already be it; but I don't know how site rate limiting heuristics usually trigger. |
About the
But the link is valid. Note: I can open a new issue (tomorrow), but reporting this in a hurry because I am out of time for an important appointment. |
That link is valid but does not exist in the HTML since it's handled in JS. You want to skip those anchor checks with the |
@Keats Thanks, but writing this in # Skip anchor checking for external URLs that start with these prefixes
skip_anchor_prefixes = [
# This link is valid but does not exist in the HTML since it's handled in JS
#"https://gist.github.com/",
"https://gist.github.com/necolas/1024797",
] The |
@maxild can you try with:
Place it just above the [extra] section. |
It is there in the docs, it's just that TOML is not super obvious about sections... |
@Keats yes you are right, the docs is ok. I think in such case we are focussing on the required keys and just forget about the section. But yes the issue is definitively between the chair and the keyboard. :) |
- Attempt to split the configuration file into sections to make it more readable and avoid configuration mistakes (getzola#1056). - Move translation instruction to the right part. - Add a bit more explanations to the extra section.
- Attempt to split the configuration file into sections to make it more readable and avoid configuration mistakes (getzola#1056). - Move translation instructions to the right part. - Add a bit more explanations to the extra section.
- Attempt to split the configuration file into sections to make it more readable and avoid configuration mistakes (getzola#1056). - Move translation instructions to the right part. - Add a bit more explanations to the extra section.
* Update configuration documentation - Attempt to split the configuration file into sections to make it more readable and avoid configuration mistakes (#1056). - Move translation instructions to the right part. - Add a bit more explanations to the extra section. * Take into account @Keats feedbacks * Remove short notice about translation usage - A i18n page should be created to better explain it.
I just stumbled across this overeagerness of the link checker today. On reasonably busy blogs like https://github.com/rust-embedded/blog with several hundred links into GH the limit is reached within seconds. Also the deduplication done in the linkchecker doesn't do much at all in most cases since the I've been thinking about ways to address that but a throttling mechanism really doesn't help too much here because it will just make everything slow without providing any real benefit. I think instead there should be a way to restrict link checking to new content, e.g. based on file modification times. No need to check (and potentially flag) hundreds of years old links over and over again in CI when adding new content... Also the number of parallel threads should really be configurable instead of being a hardcoded number... |
Content disappears all the time :/ I would say it's more likely that some link died in older content than recent content. I'm wondering if we can group the links to check per domain as mentioned at the beginning of the issue and have some light (and maybe configurable) throttling for domains. All the links from the same domains would get sent to the same thread. Maybe reducing the number of threads would do the trick without having to do any throttling. The rust-embedded blog would be a good test case for that. |
Sure. However I'm more worried about defunct links in new content rather than old link targets disappearing. For use in CI we certainly wouldn't want to put the onus on the contributors to fix broken links in old content, do we? I could imagine having a monthly complete check to flag broken links and CI only checking the links in files changed in the last 2 weeks. |
I agree, maybe a |
* mention code block output change * Update snap * Update themes gallery (#1082) Co-authored-by: GitHub Action <action@github.com> * Deployment guide for Vercel * Change wording a bit * Update themes gallery (#1122) Co-authored-by: GitHub Action <action@github.com> * Add feed autodiscovery documentation (#1123) * Add feed autodiscovery documentation * Fix link in template * Docs/configuration update (#1126) * Update configuration documentation - Attempt to split the configuration file into sections to make it more readable and avoid configuration mistakes (#1056). - Move translation instructions to the right part. - Add a bit more explanations to the extra section. * Take into account @Keats feedbacks * Remove short notice about translation usage - A i18n page should be created to better explain it. * add fix for (#1135) Taxonomies with identical slugs now get merged (#1136) * add test and implementation for reverse pagination * incorporate review changes Co-authored-by: Michael Plotke <bdjnks@gmail.com> Co-authored-by: Vincent Prouillet <balthek@gmail.com> Co-authored-by: GitHub Action <action@github.com> Co-authored-by: Samyak Bakliwal <w3bcode@gmail.com> Co-authored-by: René Ribaud <uggla@free.fr>
👋 Hitting some From what I see, there are at most 32 threads created to fetch the links simultaneously: zola/components/site/src/link_checking.rs Lines 122 to 128 in a9afb07
What about making this configurable as a CLI flag? I run Edit: I have about 500 external links on my website, with 150 of them being github.com links. I have to tune it down all the way to 2 threads. Starting from 4, I get 429 errors. |
I stumbled upon Lychee, which a link checker written in Rust. It allows setting a GitHub token to avoid being rate limited when checking for Github links: https://github.com/lycheeverse/lychee#github-token. Not sure if this is a feature that would make it into Zola as it might be too niche. |
This definitely needs to be fixed. Either group the links by domain and add some delay in between or just tune down the number of threads or both. |
Okay! I shall try to come up with a PR |
Fix for getzola#1056. - assign all links for a domain to the same thread - reduce number of threads from 32 to 8 - add sleep between HTTP calls
* link_checking: prevent rate-limiting Fix for #1056. - assign all links for a domain to the same thread - reduce number of threads from 32 to 8 - add sleep between HTTP calls * Add get_link_domain(), use for loops * Do not sleep after last link for domain * Avoid quadratic complexity * remove prints
Bug Report
Environment
Zola version:
0.11.0
Current Behavior
Hi, when running the link checker on my site zola reports spurious network errors, although opening the same site in Firefox works fine. As a subsequent error, if I try to re-run the checker I then sometimes get blocked by other hosts.
For example, first run:
Running again:
In contrast to them being labeled "temporary", once I encounter
os error 11002
messages I canzola check
Expected Behavior
It would be nice if
zola
would not fail on theseos error 11002
;Step to reproduce
As mentioned, this does not reliably reproduce. In any case this branch and
zola check
might do the trick.The text was updated successfully, but these errors were encountered: