Link checker sometimes flaky #1056

ralfbiedert · 2020-06-08T18:35:19Z

Bug Report

Environment

Zola version: 0.11.0

Current Behavior

Hi, when running the link checker on my site zola reports spurious network errors, although opening the same site in Firefox works fine. As a subsequent error, if I try to re-run the checker I then sometimes get blocked by other hosts.

For example, first run:

D:\Development\Source\cheats.rs>bash ./deploy.sh --live
Checking site...
Checking 0 internal link(s) with an anchor.
Checking 256 external link(s).
> Checked 256 external link(s): 3 error(s) found.
Failed to check the site
Error: Dead link in D:/Development\Source\cheats.rs\content\_index.md to https://rustup.rs/: error sending request for url (https://rustup.rs/): error trying to connect: dns error: This is usually a temporary error during hostname resolution and means that the local server did not receive a response from an authoritative server. (os error 11002)
Dead link in D:/Development\Source\cheats.rs\content\_index.md to https://www.jetbrains.com/idea/: error sending request for url (https://www.jetbrains.com/idea/): error trying to connect: dns error: This is usually a temporary error during hostname resolution and means that the local server did not receive a response from an authoritative server. (os error 11002)
Dead link in D:/Development\Source\cheats.rs\content\_index.md to https://www.jetbrains.com/clion/: error sending request for url (https://www.jetbrains.com/clion/): error trying to connect: dns error: This is usually a temporary error during hostname resolution and means that the local server did not receive a response from an authoritative server. (os error 11002)
error with 'zola'

Running again:

D:\Development\Source\cheats.rs>bash ./deploy.sh --live
Checking site...
Checking 0 internal link(s) with an anchor.
Checking 256 external link(s).
> Checked 256 external link(s): 3 error(s) found.
Failed to check the site
Error: Dead link in D:/Development\Source\cheats.rs\content\_index.md to https://github.com/bheisler/criterion.rs: Client error status code (429 Too Many Requests) received
Dead link in D:/Development\Source\cheats.rs\content\_index.md to https://github.com/async-rs/async-std: Client error status code (429 Too Many Requests) received
Dead link in D:/Development\Source\cheats.rs\content\_index.md to https://github.com/rust-lang/rust/blob/master/RELEASES.md: Client error status code (429 Too Many Requests) received
error with 'zola'

In contrast to them being labeled "temporary", once I encounter os error 11002 messages I can

semi-reliably reproduce them via zola check
but still access the sites via Firefox

Expected Behavior

It would be nice if

zola would not fail on these os error 11002;
but if these are truly "temporary", maybe by internally retrying first.

Step to reproduce

As mentioned, this does not reliably reproduce. In any case this branch and zola check might do the trick.

The text was updated successfully, but these errors were encountered:

Keats · 2020-06-18T20:37:23Z

Should we add a slight delay in between requests? If we do too much, we get a 429 like on your example :/

ralfbiedert · 2020-06-18T21:36:11Z

Random brainstorming, don't know how easy or difficult they are to implement in current code:

delay as you described, if doable throttle per domain but not between different domains
retry failed URLs after some time (e.g., 10s and print on console that URLs are retried so user knows why it's taking extra time)
cache remote link check and only re-check after Cache-Control or X minutes/hours/days if successful (local links as part of Zola generation I'd still check every time)
randomize link check order and ignore failed URLs if clearly identifiable as 429-alike errors

If throttling works that might already be it; but I don't know how site rate limiting heuristics usually trigger.

maxild · 2020-07-08T12:53:52Z

About the check command: I have an example run where this error is reported

Failed to check the site
Error: Dead link in /Users/maxfire/projects/web/maxild.github.io/content/component-based-css.md to https://gist.github.com/necolas/1024797#file-style-css-L196-L222: Anchor `#file-style-css-L196-L222` not found on page

But the link is valid.

Note: I can open a new issue (tomorrow), but reporting this in a hurry because I am out of time for an important appointment.

Keats · 2020-07-08T13:08:45Z

That link is valid but does not exist in the HTML since it's handled in JS. You want to skip those anchor checks with the skip_anchor_prefixes option: https://www.getzola.org/documentation/getting-started/configuration/

maxild · 2020-07-09T11:29:05Z

@Keats Thanks, but writing this in config.toml

# Skip anchor checking for external URLs that start with these prefixes
skip_anchor_prefixes = [
    # This link is valid but does not exist in the HTML since it's handled in JS
    #"https://gist.github.com/",
    "https://gist.github.com/necolas/1024797",
]

The check command still shows the same error?

uggla · 2020-08-11T23:45:55Z

@maxild can you try with:

[link_checker]
skip_anchor_prefixes = [
    "https://gist.github.com/necolas/1024797",
]

Place it just above the [extra] section.
It seems to work on my side. I think there is a pitfall in the documentation, the [link_checker] section is missing.
If you confirm it is ok on your side, I'll update the documentation.

Keats · 2020-08-12T07:21:32Z

It is there in the docs, it's just that TOML is not super obvious about sections...

uggla · 2020-08-12T20:14:37Z

@Keats yes you are right, the docs is ok. I think in such case we are focussing on the required keys and just forget about the section. But yes the issue is definitively between the chair and the keyboard. :)

- Attempt to split the configuration file into sections to make it more readable and avoid configuration mistakes (getzola#1056). - Move translation instruction to the right part. - Add a bit more explanations to the extra section.

- Attempt to split the configuration file into sections to make it more readable and avoid configuration mistakes (getzola#1056). - Move translation instructions to the right part. - Add a bit more explanations to the extra section.

@Keats

* Update configuration documentation - Attempt to split the configuration file into sections to make it more readable and avoid configuration mistakes (#1056). - Move translation instructions to the right part. - Add a bit more explanations to the extra section. * Take into account @Keats feedbacks * Remove short notice about translation usage - A i18n page should be created to better explain it.

therealprof · 2020-08-17T14:57:40Z

I just stumbled across this overeagerness of the link checker today. On reasonably busy blogs like https://github.com/rust-embedded/blog with several hundred links into GH the limit is reached within seconds. Also the deduplication done in the linkchecker doesn't do much at all in most cases since the HashMap of the results is only updated when the result comes back from the server which means that if the requests end up in the same "batch" they'll be fired off in parallel anyway.

I've been thinking about ways to address that but a throttling mechanism really doesn't help too much here because it will just make everything slow without providing any real benefit. I think instead there should be a way to restrict link checking to new content, e.g. based on file modification times. No need to check (and potentially flag) hundreds of years old links over and over again in CI when adding new content...

Also the number of parallel threads should really be configurable instead of being a hardcoded number...

Keats · 2020-08-17T17:20:25Z

No need to check (and potentially flag) hundreds of years old links over and over again in CI when adding new content...

Content disappears all the time :/ I would say it's more likely that some link died in older content than recent content.

I'm wondering if we can group the links to check per domain as mentioned at the beginning of the issue and have some light (and maybe configurable) throttling for domains. All the links from the same domains would get sent to the same thread. Maybe reducing the number of threads would do the trick without having to do any throttling. The rust-embedded blog would be a good test case for that.

therealprof · 2020-08-17T18:37:32Z

Content disappears all the time :/ I would say it's more likely that some link died in older content than recent content.

Sure. However I'm more worried about defunct links in new content rather than old link targets disappearing. For use in CI we certainly wouldn't want to put the onus on the contributors to fix broken links in old content, do we? I could imagine having a monthly complete check to flag broken links and CI only checking the links in files changed in the last 2 weeks.

Keats · 2020-08-17T21:07:26Z

I agree, maybe a url:last_check_timestamp dot file loaded from the root of the site?

@Keats

* mention code block output change * Update snap * Update themes gallery (#1082) Co-authored-by: GitHub Action <action@github.com> * Deployment guide for Vercel * Change wording a bit * Update themes gallery (#1122) Co-authored-by: GitHub Action <action@github.com> * Add feed autodiscovery documentation (#1123) * Add feed autodiscovery documentation * Fix link in template * Docs/configuration update (#1126) * Update configuration documentation - Attempt to split the configuration file into sections to make it more readable and avoid configuration mistakes (#1056). - Move translation instructions to the right part. - Add a bit more explanations to the extra section. * Take into account @Keats feedbacks * Remove short notice about translation usage - A i18n page should be created to better explain it. * add fix for (#1135) Taxonomies with identical slugs now get merged (#1136) * add test and implementation for reverse pagination * incorporate review changes Co-authored-by: Michael Plotke <bdjnks@gmail.com> Co-authored-by: Vincent Prouillet <balthek@gmail.com> Co-authored-by: GitHub Action <action@github.com> Co-authored-by: Samyak Bakliwal <w3bcode@gmail.com> Co-authored-by: René Ribaud <uggla@free.fr>

angristan · 2021-03-28T20:09:16Z

👋 Hitting some 429 Too Many Requests since I have many links to github.com on my website.

From what I see, there are at most 32 threads created to fetch the links simultaneously:

zola/components/site/src/link_checking.rs

Lines 122 to 128 in a9afb07

    
           // create thread pool with lots of threads so we can fetch 
        
           // (almost) all pages simultaneously 
        
           let threads = std::cmp::min(all_links.len(), 32); 
        
           let pool = rayon::ThreadPoolBuilder::new() 
        
               .num_threads(threads) 
        
               .build() 
        
               .map_err(|e| Error { kind: ErrorKind::Msg(e.to_string()), source: None })?;

What about making this configurable as a CLI flag? I run zola check in a CI pipeline so I don't mind if it takes longer.

Edit: I have about 500 external links on my website, with 150 of them being github.com links. I have to tune it down all the way to 2 threads. Starting from 4, I get 429 errors.

angristan · 2021-03-28T20:49:00Z

I stumbled upon Lychee, which a link checker written in Rust. It allows setting a GitHub token to avoid being rate limited when checking for Github links: https://github.com/lycheeverse/lychee#github-token. Not sure if this is a feature that would make it into Zola as it might be too niche.

Keats · 2021-03-29T09:07:11Z

This definitely needs to be fixed. Either group the links by domain and add some delay in between or just tune down the number of threads or both.

angristan · 2021-03-30T00:22:23Z

Okay! I shall try to come up with a PR

Fix for getzola#1056. - assign all links for a domain to the same thread - reduce number of threads from 32 to 8 - add sleep between HTTP calls

* link_checking: prevent rate-limiting Fix for #1056. - assign all links for a domain to the same thread - reduce number of threads from 32 to 8 - add sleep between HTTP calls * Add get_link_domain(), use for loops * Do not sleep after last link for domain * Avoid quadratic complexity * remove prints

Keats added bug good first issue help wanted labels Jun 9, 2020

uggla mentioned this issue Aug 12, 2020

Docs/configuration update #1126

Merged

1 task

uggla mentioned this issue Aug 21, 2020

Add documentation build and links checking in CI. #1145

Closed

3 tasks

angristan added a commit to angristan/zola that referenced this issue Mar 30, 2021

link_checking: prevent rate-limiting

7e42d5a

Fix for getzola#1056. - assign all links for a domain to the same thread - reduce number of threads from 32 to 8 - add sleep between HTTP calls

angristan mentioned this issue Mar 30, 2021

link_checking: prevent rate-limiting #1421

Merged

Keats removed the good first issue label Apr 21, 2021

Keats added done in pr Already done in a PR and removed help wanted labels Apr 21, 2021

Keats closed this as completed Jul 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link checker sometimes flaky #1056

Link checker sometimes flaky #1056

ralfbiedert commented Jun 8, 2020

Keats commented Jun 18, 2020

ralfbiedert commented Jun 18, 2020

maxild commented Jul 8, 2020

Keats commented Jul 8, 2020

maxild commented Jul 9, 2020

uggla commented Aug 11, 2020

Keats commented Aug 12, 2020

uggla commented Aug 12, 2020

therealprof commented Aug 17, 2020

Keats commented Aug 17, 2020

therealprof commented Aug 17, 2020

Keats commented Aug 17, 2020

angristan commented Mar 28, 2021 •

edited

Loading

angristan commented Mar 28, 2021 •

edited

Loading

Keats commented Mar 29, 2021 •

edited

Loading

angristan commented Mar 30, 2021

Link checker sometimes flaky #1056

Link checker sometimes flaky #1056

Comments

ralfbiedert commented Jun 8, 2020

Bug Report

Environment

Current Behavior

Expected Behavior

Step to reproduce

Keats commented Jun 18, 2020

ralfbiedert commented Jun 18, 2020

maxild commented Jul 8, 2020

Keats commented Jul 8, 2020

maxild commented Jul 9, 2020

uggla commented Aug 11, 2020

Keats commented Aug 12, 2020

uggla commented Aug 12, 2020

therealprof commented Aug 17, 2020

Keats commented Aug 17, 2020

therealprof commented Aug 17, 2020

Keats commented Aug 17, 2020

angristan commented Mar 28, 2021 • edited Loading

angristan commented Mar 28, 2021 • edited Loading

Keats commented Mar 29, 2021 • edited Loading

angristan commented Mar 30, 2021

angristan commented Mar 28, 2021 •

edited

Loading

angristan commented Mar 28, 2021 •

edited

Loading

Keats commented Mar 29, 2021 •

edited

Loading