Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link checker sometimes flaky #1056

Closed
ralfbiedert opened this issue Jun 8, 2020 · 16 comments
Closed

Link checker sometimes flaky #1056

ralfbiedert opened this issue Jun 8, 2020 · 16 comments
Labels
bug done in pr Already done in a PR

Comments

@ralfbiedert
Copy link

Bug Report

Environment

Zola version: 0.11.0

Current Behavior

Hi, when running the link checker on my site zola reports spurious network errors, although opening the same site in Firefox works fine. As a subsequent error, if I try to re-run the checker I then sometimes get blocked by other hosts.

For example, first run:

D:\Development\Source\cheats.rs>bash ./deploy.sh --live
Checking site...
Checking 0 internal link(s) with an anchor.
Checking 256 external link(s).
> Checked 256 external link(s): 3 error(s) found.
Failed to check the site
Error: Dead link in D:/Development\Source\cheats.rs\content\_index.md to https://rustup.rs/: error sending request for url (https://rustup.rs/): error trying to connect: dns error: This is usually a temporary error during hostname resolution and means that the local server did not receive a response from an authoritative server. (os error 11002)
Dead link in D:/Development\Source\cheats.rs\content\_index.md to https://www.jetbrains.com/idea/: error sending request for url (https://www.jetbrains.com/idea/): error trying to connect: dns error: This is usually a temporary error during hostname resolution and means that the local server did not receive a response from an authoritative server. (os error 11002)
Dead link in D:/Development\Source\cheats.rs\content\_index.md to https://www.jetbrains.com/clion/: error sending request for url (https://www.jetbrains.com/clion/): error trying to connect: dns error: This is usually a temporary error during hostname resolution and means that the local server did not receive a response from an authoritative server. (os error 11002)
error with 'zola'

Running again:

D:\Development\Source\cheats.rs>bash ./deploy.sh --live
Checking site...
Checking 0 internal link(s) with an anchor.
Checking 256 external link(s).
> Checked 256 external link(s): 3 error(s) found.
Failed to check the site
Error: Dead link in D:/Development\Source\cheats.rs\content\_index.md to https://github.com/bheisler/criterion.rs: Client error status code (429 Too Many Requests) received
Dead link in D:/Development\Source\cheats.rs\content\_index.md to https://github.com/async-rs/async-std: Client error status code (429 Too Many Requests) received
Dead link in D:/Development\Source\cheats.rs\content\_index.md to https://github.com/rust-lang/rust/blob/master/RELEASES.md: Client error status code (429 Too Many Requests) received
error with 'zola'

In contrast to them being labeled "temporary", once I encounter os error 11002 messages I can

  • semi-reliably reproduce them via zola check
  • but still access the sites via Firefox

Expected Behavior

It would be nice if

  • zola would not fail on these os error 11002;
  • but if these are truly "temporary", maybe by internally retrying first.

Step to reproduce

As mentioned, this does not reliably reproduce. In any case this branch and zola check might do the trick.

@Keats
Copy link
Collaborator

Keats commented Jun 18, 2020

Should we add a slight delay in between requests? If we do too much, we get a 429 like on your example :/

@ralfbiedert
Copy link
Author

Random brainstorming, don't know how easy or difficult they are to implement in current code:

  • delay as you described, if doable throttle per domain but not between different domains
  • retry failed URLs after some time (e.g., 10s and print on console that URLs are retried so user knows why it's taking extra time)
  • cache remote link check and only re-check after Cache-Control or X minutes/hours/days if successful (local links as part of Zola generation I'd still check every time)
  • randomize link check order and ignore failed URLs if clearly identifiable as 429-alike errors

If throttling works that might already be it; but I don't know how site rate limiting heuristics usually trigger.

@maxild
Copy link

maxild commented Jul 8, 2020

About the check command: I have an example run where this error is reported

Failed to check the site
Error: Dead link in /Users/maxfire/projects/web/maxild.github.io/content/component-based-css.md to https://gist.github.com/necolas/1024797#file-style-css-L196-L222: Anchor `#file-style-css-L196-L222` not found on page

But the link is valid.

Note: I can open a new issue (tomorrow), but reporting this in a hurry because I am out of time for an important appointment.

@Keats
Copy link
Collaborator

Keats commented Jul 8, 2020

That link is valid but does not exist in the HTML since it's handled in JS. You want to skip those anchor checks with the skip_anchor_prefixes option: https://www.getzola.org/documentation/getting-started/configuration/

@maxild
Copy link

maxild commented Jul 9, 2020

@Keats Thanks, but writing this in config.toml

# Skip anchor checking for external URLs that start with these prefixes
skip_anchor_prefixes = [
    # This link is valid but does not exist in the HTML since it's handled in JS
    #"https://gist.github.com/",
    "https://gist.github.com/necolas/1024797",
]

The check command still shows the same error?

@uggla
Copy link
Contributor

uggla commented Aug 11, 2020

@maxild can you try with:

[link_checker]
skip_anchor_prefixes = [
    "https://gist.github.com/necolas/1024797",
]

Place it just above the [extra] section.
It seems to work on my side. I think there is a pitfall in the documentation, the [link_checker] section is missing.
If you confirm it is ok on your side, I'll update the documentation.

@Keats
Copy link
Collaborator

Keats commented Aug 12, 2020

It is there in the docs, it's just that TOML is not super obvious about sections...

@uggla
Copy link
Contributor

uggla commented Aug 12, 2020

@Keats yes you are right, the docs is ok. I think in such case we are focussing on the required keys and just forget about the section. But yes the issue is definitively between the chair and the keyboard. :)

uggla added a commit to uggla/zola that referenced this issue Aug 12, 2020
- Attempt to split the configuration file into sections to make it more readable and
  avoid configuration mistakes (getzola#1056).
- Move translation instruction to the right part.
- Add a bit more explanations to the extra section.
uggla added a commit to uggla/zola that referenced this issue Aug 12, 2020
- Attempt to split the configuration file into sections to make it more readable and
  avoid configuration mistakes (getzola#1056).
- Move translation instructions to the right part.
- Add a bit more explanations to the extra section.
@uggla uggla mentioned this issue Aug 12, 2020
1 task
uggla added a commit to uggla/zola that referenced this issue Aug 13, 2020
- Attempt to split the configuration file into sections to make it more readable and
  avoid configuration mistakes (getzola#1056).
- Move translation instructions to the right part.
- Add a bit more explanations to the extra section.
Keats pushed a commit that referenced this issue Aug 16, 2020
* Update configuration documentation

- Attempt to split the configuration file into sections to make it more readable and
  avoid configuration mistakes (#1056).
- Move translation instructions to the right part.
- Add a bit more explanations to the extra section.

* Take into account @Keats feedbacks

* Remove short notice about translation usage

- A i18n page should be created to better explain it.
@therealprof
Copy link
Contributor

I just stumbled across this overeagerness of the link checker today. On reasonably busy blogs like https://github.com/rust-embedded/blog with several hundred links into GH the limit is reached within seconds. Also the deduplication done in the linkchecker doesn't do much at all in most cases since the HashMap of the results is only updated when the result comes back from the server which means that if the requests end up in the same "batch" they'll be fired off in parallel anyway.

I've been thinking about ways to address that but a throttling mechanism really doesn't help too much here because it will just make everything slow without providing any real benefit. I think instead there should be a way to restrict link checking to new content, e.g. based on file modification times. No need to check (and potentially flag) hundreds of years old links over and over again in CI when adding new content...

Also the number of parallel threads should really be configurable instead of being a hardcoded number...

@Keats
Copy link
Collaborator

Keats commented Aug 17, 2020

No need to check (and potentially flag) hundreds of years old links over and over again in CI when adding new content...

Content disappears all the time :/ I would say it's more likely that some link died in older content than recent content.

I'm wondering if we can group the links to check per domain as mentioned at the beginning of the issue and have some light (and maybe configurable) throttling for domains. All the links from the same domains would get sent to the same thread. Maybe reducing the number of threads would do the trick without having to do any throttling. The rust-embedded blog would be a good test case for that.

@therealprof
Copy link
Contributor

Content disappears all the time :/ I would say it's more likely that some link died in older content than recent content.

Sure. However I'm more worried about defunct links in new content rather than old link targets disappearing. For use in CI we certainly wouldn't want to put the onus on the contributors to fix broken links in old content, do we? I could imagine having a monthly complete check to flag broken links and CI only checking the links in files changed in the last 2 weeks.

@Keats
Copy link
Collaborator

Keats commented Aug 17, 2020

I agree, maybe a url:last_check_timestamp dot file loaded from the root of the site?

Keats added a commit that referenced this issue Sep 1, 2020
* mention code block output change

* Update snap

* Update themes gallery (#1082)

Co-authored-by: GitHub Action <action@github.com>

* Deployment guide for Vercel

* Change wording a bit

* Update themes gallery (#1122)

Co-authored-by: GitHub Action <action@github.com>

* Add feed autodiscovery documentation (#1123)

* Add feed autodiscovery documentation

* Fix link in template

* Docs/configuration update (#1126)

* Update configuration documentation

- Attempt to split the configuration file into sections to make it more readable and
  avoid configuration mistakes (#1056).
- Move translation instructions to the right part.
- Add a bit more explanations to the extra section.

* Take into account @Keats feedbacks

* Remove short notice about translation usage

- A i18n page should be created to better explain it.

* add fix for (#1135) Taxonomies with identical slugs now get merged (#1136)

* add test and implementation for reverse pagination

* incorporate review changes

Co-authored-by: Michael Plotke <bdjnks@gmail.com>
Co-authored-by: Vincent Prouillet <balthek@gmail.com>
Co-authored-by: GitHub Action <action@github.com>
Co-authored-by: Samyak Bakliwal <w3bcode@gmail.com>
Co-authored-by: René Ribaud <uggla@free.fr>
@angristan
Copy link
Contributor

angristan commented Mar 28, 2021

👋 Hitting some 429 Too Many Requests since I have many links to github.com on my website.

From what I see, there are at most 32 threads created to fetch the links simultaneously:

// create thread pool with lots of threads so we can fetch
// (almost) all pages simultaneously
let threads = std::cmp::min(all_links.len(), 32);
let pool = rayon::ThreadPoolBuilder::new()
.num_threads(threads)
.build()
.map_err(|e| Error { kind: ErrorKind::Msg(e.to_string()), source: None })?;

What about making this configurable as a CLI flag? I run zola check in a CI pipeline so I don't mind if it takes longer.

Edit: I have about 500 external links on my website, with 150 of them being github.com links. I have to tune it down all the way to 2 threads. Starting from 4, I get 429 errors.

@angristan
Copy link
Contributor

angristan commented Mar 28, 2021

I stumbled upon Lychee, which a link checker written in Rust. It allows setting a GitHub token to avoid being rate limited when checking for Github links: https://github.com/lycheeverse/lychee#github-token. Not sure if this is a feature that would make it into Zola as it might be too niche.

@Keats
Copy link
Collaborator

Keats commented Mar 29, 2021

This definitely needs to be fixed. Either group the links by domain and add some delay in between or just tune down the number of threads or both.

@angristan
Copy link
Contributor

Okay! I shall try to come up with a PR

angristan added a commit to angristan/zola that referenced this issue Mar 30, 2021
Fix for getzola#1056.

- assign all links for a domain to the same thread
- reduce number of threads from 32 to 8
- add sleep between HTTP calls
Keats pushed a commit that referenced this issue Apr 21, 2021
* link_checking: prevent rate-limiting

Fix for #1056.

- assign all links for a domain to the same thread
- reduce number of threads from 32 to 8
- add sleep between HTTP calls

* Add get_link_domain(), use for loops

* Do not sleep after last link for domain

* Avoid quadratic complexity

* remove prints
@Keats Keats added done in pr Already done in a PR and removed help wanted labels Apr 21, 2021
@Keats Keats closed this as completed Jul 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug done in pr Already done in a PR
Projects
None yet
Development

No branches or pull requests

6 participants