-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add heuristic checking for HTML anchors #1716
Conversation
components/utils/src/links.rs
Outdated
@@ -0,0 +1,16 @@ | |||
pub fn anchor_id_checks(anchor:&str) -> Vec<String> { | |||
vec![ | |||
format!(" id={}", anchor), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will regex be slower or faster here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, this is a lot of allocation, i'm guessing the regex would be faster? But hard to say without a benchmark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly faster and probably more maintainable. I just copied what was already there, but happy to refactor it out to use a regexp. This code only runs if other anchor checks fail, so any performance hit will only be on sites that are broken or have these anchors in.
The regexp crate is already a dependency, so nothing wrong with trying. Am guessing we'd need a few hundred links to make a benchmark informative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's go with the regex approach then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, pushed. Performance testing results.
I compiled site with a two small files but one contain ~2k links to an XML anchor. I then tested various options, in all cases using cargo run
but taking the time reported at the end.
Comparing regexp to direct string matching
regexp:
--release: 1.3s
no --release: 39s
non-regexp:
--release: 1.0s
no --release: 19.4s
Conclusion: Regexp is slower esp for non-optimized builds.
Then comparing the non-regexp implementation with heuristic_link_check false or true
heuristic_link_check = false --release 0.5s
heuristic_link_check = false no --release 1.1s
Conclusion: The heuristic link check causes a 100% in build times (for release build) and a much bigger increase for non-optimized.
However, in this case zola produces lots of error messages and printouts can be slow. So, I have also tried fixing the broken links by adding an markdown generated anchor (with the same ~2k links to it). I also tried comparing this situation with and without the heuristic link checker.
--release, no failures: 0.5s
no -- release, no failures: 1.1s
heuristic_link_check = false --release no failures 0.5s
heuristic_link_check = false no --release no failures 1.1s
Conclusion 1: printing errors doesn't make much of a difference to the performance.
Conclusion 2: if the heuristic link checker is on, it does not make that much difference if most links are not broken. This would be predicted from the code.
So, I would say the regexp code is a bit neater, the multiple format! checking is a bit faster, so either implementation is fine. I'd probably got with the regexp cause neat trumps fast for me, but happy with what ever you prefer.
And while the heuristic_link_checker is significantly slows things down, you only pay for it if you are using XML anchors. As zola current refuses to build a site under these circumstances, it seems a reasonable price to pay.
components/utils/src/links.rs
Outdated
|
||
pub fn anchor_id_checks(anchor:&str) -> Regex { | ||
Regex::new( | ||
&format!(r#" (id|ID|name|NAME) *= *("|'){}("|')"#, anchor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're missing the ones where they have no quotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also make regex case-insensitive to handle Name, iD, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now done, hopefully, with any obscure side effects.
@@ -130,6 +130,9 @@ skip_anchor_prefixes = [ | |||
"https://caniuse.com/", | |||
] | |||
|
|||
# Check for links to anchors defined in HTML rather than markdown | |||
heuristic_link_check = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the end since it's only ran on local failures, it's probably fine to not need an option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your call, but it would give the option of disabling something that can produce failures. Switching the default around might be easier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's already the workaround of creating the link in HTML directly
components/site/src/link_checking.rs
Outdated
|
||
!(page.has_anchor(anchor)|| | ||
site.config.link_checker.heuristic_link_check && | ||
page.has_anchor_id(anchor)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is regex created for each page? Would caching it make processing faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For each anchor and each page. I would guess that it would only make a minor difference in most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think caching the regex wouldn't be as useful because the regex pattern contains the text of each anchor, so the likelihood any particular regex would be needed again is slim.
Unless the method was flipped around, so there was a single cached regex behind a mutex, with a pattern like r"(?i) (id|name) *= *"[^"]"? "
(case insesitively match all name or id attributes (I used only double quoting for simplicity)). Then there's only one regex, which can find all matches in a given page, then you can check those matches for the anchor in question.
I think there's some chance that would be faster than creating so many fresh regexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mwcz I would agree that this might be faster. Then I thought regexs would be faster than exact string matching and was wrong. We don't have a good test realistic test case for any kind of real world benchmarking. My natural inclination would be to leave the simplest implementation till it's clear that this is a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure, +1 to keeping it simple at first.
For benchmarking, I might be able to help with that. The site I'm starting to build with zola has 130,000 pages of markdown, with a lot of name/id anchors, so it would be highly sensitive to any performance issues. I can run benchmarks myself with that content. Maybe if I can find a way to scrub the content (lorem-ipsumify it but keep the markdown layout) then I could share it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mwcz That would be a good benchmark but only to see it it this code does not affect it at all. This code only runs if the current link checker fails. So if your current site builds (i.e. all the links work), then this code should never run. That's a useful thing to check, but doesn't give any indication of the performance of this code. I'm assuming your anchors here are either zola generated section slugs or manually specified anchors.
What you need could do is take half the markdown files, turn them into header/footer-less HTML with pandoc or equivalent. Now you should have 65,000 markdown files with pointers to 65,000 HTML files, so half of your outlinks would be to XML specified anchors and therefore use this code.
Quite how we would interpret the results I don't know. This check is surely slower than checking for markdown anchors which have to be parsed out anyway. So, it's not clear what would be an acceptable slowdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fastest implementation would be integrated into the markdown=>HTML convertor, where it's easy to collect all ids/links at the conversion stage and then to finally check at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@biodranik Who knows what would be fastest, without a good test. Collecting all the ids/links would potentially mean checking a lot more links, so it might be slower. Of course, it does mean that the link checker could check all links. At the moment HTML -> HTML links that are broken would not be picked up.
It's a question of what the link checker is there for. I think it's fine that it is limited to markdown links, so long as it does not preclude me putting in HTML anchors.
15c2b00
to
88171a4
Compare
Previously only anchors specified or generated in markdown could be linked to, without complaint from the link checker. We now use a simple heuristic check for `name` or `id` attributes. Duplicate code has been refactored and all XML anchor checks updated to use regex rather than substring match.
88171a4
to
a473333
Compare
Think this is ready to go now! |
components/utils/src/links.rs
Outdated
|
||
pub fn anchor_id_checks(anchor:&str) -> Regex { | ||
Regex::new( | ||
&format!(r#" (?i)(id|ID|name|NAME) *= *("|')*{}("|'| |>)+"#, anchor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're setting the ?i
flag, we can get rid of the multiple cases in the first part of the regex
@@ -28,6 +29,12 @@ pub fn has_anchor(headings: &[Heading], anchor: &str) -> bool { | |||
false | |||
} | |||
|
|||
|
|||
pub fn has_anchor_id(content: &str, anchor: &str) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably move this function to utils directly since we will only ever do is_match
That looks ok to me, anyone else having comments? |
@@ -9,6 +9,7 @@ include = ["src/**/*"] | |||
tera = "1" | |||
unicode-segmentation = "1.2" | |||
walkdir = "2" | |||
regex="1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space
|
||
// Case variants | ||
assert!(m(r#"<a ID="fred">"#)); | ||
assert!(m(r#"<a iD="fred">"#)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add some other tags, so readers can easily understand, that the code also works with any element with id?
Looks good to me. It works well in my two trivial test case and in @mwcz's Giant Site, which had 493 broken anchors in 0.15.2 and with this patch is down to 209 (which seem to be real). And there was no noticable performance difference at all. |
* Add heuristic checking for HTML anchors Previously only anchors specified or generated in markdown could be linked to, without complaint from the link checker. We now use a simple heuristic check for `name` or `id` attributes. Duplicate code has been refactored and all XML anchor checks updated to use regex rather than substring match. * Fix regexp and refactor
* Add heuristic checking for HTML anchors Previously only anchors specified or generated in markdown could be linked to, without complaint from the link checker. We now use a simple heuristic check for `name` or `id` attributes. Duplicate code has been refactored and all XML anchor checks updated to use regex rather than substring match. * Fix regexp and refactor
* Add heuristic checking for HTML anchors Previously only anchors specified or generated in markdown could be linked to, without complaint from the link checker. We now use a simple heuristic check for `name` or `id` attributes. Duplicate code has been refactored and all XML anchor checks updated to use regex rather than substring match. * Fix regexp and refactor
As a starter for discussion, not finished yet.
Previously only anchors specified or generated in markdown could be
linked to, without complaint from the link checker. We now use a
simple heuristic check for
name
orid
attributes.Addresses #1707
IMPORTANT: Please do not create a Pull Request adding a new feature without discussing it first.
The place to discuss new features is the forum: https://zola.discourse.group/
If you want to add a new feature, please open a thread there first in the feature requests section.
Sanity check:
Code changes
(Delete or ignore this section for documentation changes)
next
branch?If the change is a new feature or adding to/changing an existing one: