Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support anchor link targets #31

Open
becheran opened this issue Apr 25, 2021 · 61 comments
Open

Support anchor link targets #31

becheran opened this issue Apr 25, 2021 · 61 comments

Comments

@becheran
Copy link
Owner

becheran commented Apr 25, 2021

Anchor links

The part after the # called anchor link is currently not checked. A markdown link including an anchor link target is for example [anchor](#go-to-anchor-on-same-page), or [external](http://external.com#headline-1).

How do anchor links work

HTML defines anchors targets via the anchor name tag (e.g. <a id="generator"></a>). An anchor target can also be any html tag with an id attribute (e.g. <div id="fol-bar"></div>).

The official markdown spec does not define anchor targets. But most interpreters and renderer support the generation of default anchor targets in markdown documents. For example the github markdown flawor supports auto generated link targets for all headline (h1 to h6) to with the following rules:

  1. downcase the headline
  2. remove anything that is not a letter, number, space or hyphen
  3. changes any space to a hyphen

Implementation hints

A first good step would be to add valid anchor links example markdown files to the benches dir which will be used for the [end-to-end unit tests[(./tests/end_to_end.rs).

The library run method is the most important method which will use all submodules and does the actual execution.

In the link extractor module the part after the # needs to be extracted and saved in the MarkupLink struct.

The lilnk validator module is responsible for the actual resolving and check whether a resource exists (either on disk or as URL. This code needs to be enhanced to not only check for existence if an anchor link was extracted, but also actually parse the target file and extract all anchor targets. Same must be done for we links. Here a HEAD request is send right now and only of that failes a GET is send. If an achor link needs to be followed a GET request is needed and the resulting page needs to be parsed for all anchors.

Besides the already existing grouping of same links which are only checked once for performance boost, it would also make sense to parse an document wich contains an anchor to it only once and reuse the parse result for others references to the same doc, Also for performacne reasons it would be great to only download and parse documents which actually have an anchor link to them and not all docs for all links.

@hoijui
Copy link
Contributor

hoijui commented Apr 26, 2021

@hoijui if you want to contribute and help improving mlc this would be awesome of course. Hope the tool actually does what you want it to do. I crated it with my (and others) GitHub READMES and docs in mind to eliminate broken links there.

cool! :-)
My use case is the same; mainly for docs of open source hardware projects.

You are right.. no file-system caching of downloaded files is necessary. scanning for references once downloaded and cachign that in memory makes more sense... sounds all good that you write here.

PS: The link sin your issue do not work. Looks like you used relative links, and instead of them going to the soures, they apply directly to this URL (.../issues/). :/

Here the (absolute) working links:

Thanks a lot! great issue.. I hope I get to look into it these days (I think I will :-) ).

@hoijui
Copy link
Contributor

hoijui commented Apr 26, 2021

A question ...
in the benchmark directory, there are some markdown files (directly in that dir), and then there is a markdown and an html directory, with markdown and html files respectively.
why are there markdown files in the first level? or say.. should I just put new files just under the markdown and html dirs?

@becheran
Copy link
Owner Author

PS: The link sin your issue do not work. Looks like you used relative links, and instead of them going to the soures, they apply directly to this URL (.../issues/). :/

Here the (absolute) working links:

Thanks! Don't know what I thought when I created those links. Broken links in a link checker tool issue... 🤦‍♂️

@becheran
Copy link
Owner Author

A question ...
in the benchmark directory, there are some markdown files (directly in that dir), and then there is a markdown and an html directory, with markdown and html files respectively.
why are there markdown files in the first level? or say.. should I just put new files just under the markdown and html dirs?

Good question, now that I look at those markdown files I wonder why I did not put them in their corresponding dirs in the first place? Would be great if you add tests for anchor links in markdown and html dir. I will cleanup here.

One more thing is that there is already a warn message in https://github.com/becheran/mlc/tree/master/src/link_validator/file_system.rs line 65 were I print a message that everything after # is ignored right now. So right now links with an anchor will not let mlc fail, but print an error log message instead.

@hoijui
Copy link
Contributor

hoijui commented Apr 27, 2021

a question ...

You have code scanning Markdown files for links (src/markdown_link_extractor.rs) and code scanning HTML files for links (src/link_extractors/html_link_extractor.rs).

Now we need code scanning for anchor targets in Markdown and HTML too.
Should this code be integrated into the link scanner code, or should it be separate?
If it gets integrated, we can make the code so one can select to only scan for links, only for anchor targets, or for both. This way, we need to scan each file only once. If we separate the code, and scan only for one of the two during each scan, it will probably be less performant, and the code will probably be bigger, overall, and more complicated; right?
So I suppose, integration is the way to go?

An other question: should we make the anchor checking optional? (I would say yes, because.. why not?)

@becheran
Copy link
Owner Author

So I suppose, integration is the way to go?

Wow just had a look at my parser code again I have to admit it is a bit hard to read with all the conditions...

But in the end, both extractors do a push of a new link to the result vectors at one or several points. For example line 159 in markdown_link_extractor.rs. Instead of setting the target directly here it might make sense to call another static function which does the splitting of anchor and target and add this to the MarkupLink struct since this would be the same for the html and markup link extractor or not? Maybe add the anchor as another option field of the MarkupLink struct such as pub anchor: Option<String>?

The actual anchor check and file load/download should happen in the link_validtor somewhere I would say (without having a 100% clear strategy yet).

An other question: should we make the anchor checking optional? (I would say yes, because.. why not?)

Yes. Though, I would prefer it to be enabled as default and can be disabled via a command line arg such as --no-anchors.

@becheran
Copy link
Owner Author

Wow just had a look at my parser code again I have to admit it is a bit hard to read with all the conditions...

Speaking of hard to read... I am trying to get rid of the a bit strange looking and not so performant throttle code in lib.rs on a sperate branch

@hoijui
Copy link
Contributor

hoijui commented Apr 28, 2021

Wow just had a look at my parser code again I have to admit it is a bit hard to read with all the conditions...

Speaking of hard to read... I am trying to get rid of the a bit strange looking and not so performant throttle code in lib.rs on a sperate branch

ok.. I do not understand that, but.. good luck! :-)
and.. does that mean, that right now I am able to work on the parts I need to change (mainly link_validator stuff), without causing many merge conflicts for you (or me) later?

@becheran
Copy link
Owner Author

and.. does that mean, that right now I am able to work on the parts I need to change (mainly link_validator stuff), without causing many merge conflicts for you (or me) later?

You can work on that part right now. My changes should be far away in lib.rs and if we need to merge I will do that. No worries

@hoijui
Copy link
Contributor

hoijui commented May 12, 2021

For example the github markdown flawor supports auto generated link targets for all headline (h1 to h6) to with the following rules:

  1. downcase the headline
  2. remove anything that is not a letter, number, space or hyphen
  3. changes any space to a hyphen

As you write, this is the case for GFM, but I just found out, that Pandoc Markdown does it differently. :/
It strips any number in the beginning of the title.
This means, that for this anchor checking, we have to allow the user to choose which Markdown flavor they are using, and we have to support the different algorithms of these flavors, I guess.

@becheran
Copy link
Owner Author

That's the problem of not having a md standard. Every implementation can do it differently :-(

I guess there is no way around having a command line option. But what would be your most relevant usecase? Personally I use mlc mainly for GitHub Readme checks so I would love to have GitHub style support.

Just re-visited issue #19 where someone states:

Yeah, this would be very nice. I use those with GitLab docs extensively. Sometimes, I even use custom anchors:

### This is a very long header {#long-header}

Let's [link to it](#long-header)

Never saw this {#long-header} syntax for headline anchors before. So this might be a third style only valid for GitLab markdown...

@dvtomas
Copy link

dvtomas commented May 12, 2021

@becheran It was me writing about the {#id} syntax. I use it when generating documents from Markdown using pandoc, the relevant docs are here, look for Heading identifiers. I might have been wrong about GitLab supporting it, sorry about that. Pandoc docs say the same syntax works with PHP Markdown extra, see here. So, it may be worth considering configuring support for this syntax not on a Markdown variant basis, but just with a switch enabling this extended support (imagine something like --enable-special-attributes). Just an idea, though.

@hoijui
Copy link
Contributor

hoijui commented May 12, 2021

My main use cases are GitHub, GitLab and Pandoc too. I also use the {#id} syntax (with pandoc). It is quite useful for when you have documents of this structure:

## Foo

### Intro {#foo-intro}

## Bar

### Intro {#bar-intro}

Without that syntax, the anchors would just be enumerated (#intro and #intro-1, for example), which of course is cryptic and subject to change, when the document structure changes.

@hoijui
Copy link
Contributor

hoijui commented May 31, 2021

.. I am not dead!
still on it, although sporadically. today I did some stuff again, and for the first time, felt somewhat comfortable with using rust.. yeay!
somehow I feel, it is partly because of your code. It is not "sophisticated" in a bad way... thanks ! :-)
still no commits to see yet, but I am mostly done with the benchmark files (at least in Markdown), and I am about 30% through with a first, basic implementation in the code.

@becheran
Copy link
Owner Author

@hoijui take all the time you want. There are no deadlines. We do this for fun. At least this is how I treat coding in my spare time. It should be at least as enjoyable as watching a Netflix series 😛

If you need/want any help, plz let me know. Rust rocks! 🦀

@hoijui
Copy link
Contributor

hoijui commented Jun 1, 2021

:-)
thank you @becheran ! It feels good to hear that.

And yes.. rust can be fun I learned now, cool! :-)
I am a bit unsure how to go about the parser part. It needs to be extended with parsing for the anchors (those auto-generated from titles and pandoc-markdown style, manually set ones). When reading the parser part, it looks a bit.. very verbose.. is this really the best way to do it, comparing the chars to come like this:

            } else if line_chars.get(column) == Some(&'<')
                && line_chars.get(column + 1) == Some(&'/')
                && line_chars.get(column + 2) == Some(&'s')
                && line_chars.get(column + 3) == Some(&'c')
                && line_chars.get(column + 4) == Some(&'r')
                && line_chars.get(column + 5) == Some(&'i')
                && line_chars.get(column + 6) == Some(&'p')
                && line_chars.get(column + 7) == Some(&'t')
                && line_chars.get(column + 8) == Some(&'>')
            {

obviously, I am way too newb to decide, it just looks too verbose to me.
.. maybe a helper function that takes line_chars and "</script>" or an other Vec<char> containing "</script>"?
or can this not be made as performant?

That is a minor thing though, I am not much concerned about it, but I am about this:
I do not feel confident that I would.. manually edit this parsing code, and not produce lots of bugs (wrongly parsed edge-cases). How did you come up with this code?
For example: The markdown parsing code seems to skip ATX-style titles. So links in titles are not recognized by mlc, but CommonMark supports them, and everything else I tried so far did too.
That then makes me ask... should we use an external parser, a library for that?
It seems that there are not many rust markdown parsers, and the seem to support only CommonMark and GFM, but I would also need pandoc markdown (for the manually set anchors). There is also a rust library interfacing with pandoc (the binary), to produce pandoc ASTs from rust, which can then be used in rust. Pandoc supports GFM, CommonMark, PandocMarkdown, HTML, reStrucutredText, and MANY more input formats!
It just has two mayor downsides:

  • it is not rust, but we would be calling an external binary (written in Haskell)
  • Pandoc does not give source line and column of the parsed parts

The second issue there, makes it practically unusable. :/
... except.. we venture into pandoc development, majorly!
Coincidentally, I know some of the pandoc internal issues, which would make this a huge task: It is made up of many parts (obvious, when looking at all the languages it supports), that all interconnect to each other through its AST. Pandoc supports no versioning or other modularity meta-data for an AST, as of now, which means.. every change that would affect the AST, needs EVERYTHING to be modified. this includes all the parsing and serializing modules, and even much worse, all the filters written by everyone who uses pandoc.
And this is by far the best document conversion tool out there (in terms of spread and functionality). There is basically no alternative to it.

... Yet an other approach would be, to use a general parsing library (lexer, PEG, LR, LL, ... I don't know what all that means ;-) ).

I did not check for HTML, but it being much more standardized, wider spread and so on, I bet there would be one or two good rust HTML parsers that could be used, though even editing your code for that looks easy enough for me to do it.

But yeah... I am a bit frustrated with the Markdown issue. Of course it all comes down in the end, to markdown not being one standard (plus pandocs internal issues).

... breathing again

@hoijui
Copy link
Contributor

hoijui commented Jun 1, 2021

The links-in-title thing is something I came about purely by accident, and I know I would have missed much more if I had written the parser; I just mentioned it because ... unelegantly put, it made me feel my fear of this being complex is valid.

@hoijui
Copy link
Contributor

hoijui commented Jun 3, 2021

I had some bad days. am better now... ignore the above! ;-)

@becheran
Copy link
Owner Author

becheran commented Jun 3, 2021

@hoijui sorry for late reply.

You are right. The parser part is nothing to be proud of. Was kind of the first rust code that I wrote and it looks pretty ugly. I am sorry. My thoughts when I started this was this:

  • I need to extract links from markdown with keeping the line and column
  • There is no lib out there which does the job
  • I came up with this if-cascade/match-case parser monster

Another issue with this code is that I did not knew the very well documented CommonMark and GitHub favor Markdown specs. Should have read them before. Anyways... too late and it does work, except for edge cases (aka. bugs) when it doesn't :-P.

@becheran
Copy link
Owner Author

becheran commented Jun 3, 2021

The links-in-title thing is something I came about purely by accident, and I know I would have missed much more if I had written the parser; I just mentioned it because ... unelegantly put, it made me feel my fear of this being complex is valid.

It is a bug? I wonder what I thought when I wrote the link_in_headline unit test?? Was it not possible the have links in headlines before in GitHub? Did they change the markdown parser? Can't reconstruct this anymore. Fact is that one can have links in headlines even on GitHub. Issue #35

@becheran
Copy link
Owner Author

becheran commented Jun 3, 2021

OK. And for the future of this issue...

The usage of pandoc seems unfortunately not an option due to the issue that you pointed out. Without the source file metadata the parser is not really practical here. :-(

So I assume there is no real way around writing an own parser? Just did a quick search and found this lib which looks as if it does a good job in extracting tokens. I still believe that extracting links and headlines from markdown flavors should be a solvable problem.

What do you thing @hoijui? Do you think it makes sense if I re-write the parser using the logos lexer?

@dvtomas
Copy link

dvtomas commented Jun 4, 2021

Hi all,

I recently wrote a very simple quick and dirty intra-links checking for my markdown documents. I'm not sure if it's of any utility for you, but just maybe.

Here's the code. The Itertools dependency is there just for the join method and can be easily get rid of.

use std::collections::HashSet;
use itertools::Itertools;

/// Checks that each `[Foo](#foo)` link has a corresponding `# Foo header {#foo}` header
fn check_markdown_anchor_links(document: &str) -> Result<(), String> {
    fn words<'a>(s: &'a str, starting_with: &str, ending_with: char) -> HashSet<&'a str> {
        s.split(starting_with).skip(1).filter_map(|s| s.split_once(ending_with).map(|s| s.0)).collect()
    }

    let anchors = words(document, "{#", '}');
    let links = words(document, "(#", ')');
    let links_without_anchors = links.difference(&anchors).map(|link| *link).collect_vec();
    if links_without_anchors.is_empty() {
        Ok(())
    } else {
        Err(format!("The following [](#) intra-document links don't have a corresponding {{#}} header anchor: {}", links_without_anchors.join(",")))
    }
}

/// Suppose that the file contains[Internal link to headers][] as per https://pandoc.org/MANUAL.html#extension-implicit_header_references.
/// Check that all such internal links refer to existing headers. See the test.
fn check_markdown_implicit_header_references(document: &str) -> Result<(), String> {
    let anchors: HashSet<&str> = {
        document.lines().filter_map(|l|
            if l.starts_with('#') {
                Some(l.trim_start_matches('#').trim())
            } else {
                None
            }
        ).collect()
    };

    let links: HashSet<&str> = {
        let ending = "][]";
        document
            .split_inclusive(ending)
            .filter_map(|s| if s.ends_with(ending) {
                s.trim_end_matches(ending).rsplit_once('[').map(|s| s.1)
            } else {
                None
            })
            .collect()
    };

    let links_without_anchors = links.difference(&anchors).map(|link| *link).collect_vec();
    if links_without_anchors.is_empty() {
        Ok(())
    } else {
        Err(format!("The following [...][] intra-document links don't have a corresponding # header: {}", links_without_anchors.join(",")))
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_check_markdown_implicit_header_references() {
        let document = "\
# Header
This is a link to [Header][]
This is [Ignored, no trailing empty square brackets]
This is a [link to a non-existing anchor][]
Another [Ignored, this time at the end of the document]
        ";
        assert_eq!(
            &check_markdown_implicit_header_references(document).unwrap_err(),
            "The following [...][] intra-document links don't have a corresponding # header: link to a non-existing anchor"
        )
    }

    #[test]
    fn test_check_markdown_anchor_links() {
        let document = "\
# Header {#header}
This is a [link to the header](#header)
This is a [link to a non-existing anchor](#non-existing)
This is [link to a non-existing anchor](#non-existing)
        ";
        assert_eq!(
            &check_markdown_anchor_links(document).unwrap_err(),
            "The following [](#) intra-document links don't have a corresponding {#} header anchor: non-existing"
        )
    }
}

@hoijui
Copy link
Contributor

hoijui commented Jun 4, 2021

:D :D
wow.. thank you @dvtomas .. it's like magic, that you ended up here with this right now! :-)
If I read the code correctly, you are not doing a structural parsing of the Markdown, but rather ... scan for certain strings.
So for example, you would recognize links and titles even when they are within code blocks, right?

@hoijui
Copy link
Contributor

hoijui commented Jun 4, 2021

OK. And for the future of this issue...

The usage of pandoc seems unfortunately not an option due to the issue that you pointed out. Without the source file metadata the parser is not really practical here. :-(

So I assume there is no real way around writing an own parser? Just did a quick search and found this lib which looks as if it does a good job in extracting tokens. I still believe that extracting links and headlines from markdown flavors should be a solvable problem.

What do you thing @hoijui? Do you think it makes sense if I re-write the parser using the logos lexer?

Sounds good!
do you have time and muse for that? :-)
I have no practical experience with anything regarding parsing/lexers and such... but if you think this one is good... good!

I guess I'll try to wrap-up what I have so far into commits, even though it is of course not functioning without the parsing part, and.. later retro-fit it to your changes... ?

@dvtomas
Copy link

dvtomas commented Jun 4, 2021

@hoijui No magic, I've been watching the issue because I'm also interested in that :-)

You're right about the code just looking for opening and closing sequences regardless of context. Adding support for code blocks shouldn't be too hard, though, I can imagine parsing the document line by line in a similar fashion my code does, and adding a simple state indicator inside_block that would be triggered on and off when triple backtick is encountered (and probably ignoring the line if it starts with a four-spaces indent, that's another way of expressing a code block). I'm not saying that it's the best solution, but it is a simple one. Tests on real-world documents would probably reveal if it is viable.

@hoijui
Copy link
Contributor

hoijui commented Jun 4, 2021

@dvtomas lists also use indenting, and code withing lists and lists in lists may use more indents ...
if you could get away with this approach, nobody would use the more complex mechanisms like a lexer. :/
Though if your solution works for you in your case, it is good of course!

My use case is pretty much the opposite:
I need the tool to be very robust, as I want it to be used in CI jobs for all Open Source Hardware documentation projects (which are available as a git repo). Optimally, thousands. They will come up with every possible way to challenge this tool, I bet. :-)

@dvtomas
Copy link

dvtomas commented Jun 4, 2021

Sure, I'm all for a robust solution and looking forward it. It's just that you could devise a simple solution in like a day, and have an already useful utility covering an overwhelming majority of use-cases. If some broken link inside a deeply nested list isn't reported, that's of course bad, but not a tragedy. You could work towards the perfect solution meanwhile.

Btw, have you considered adapting (perhaps improving) some existing parser, e.g. https://crates.io/crates/markdown ? Sorry if this was already discussed, I haven't checked the whole discussion.

@hoijui
Copy link
Contributor

hoijui commented Jun 4, 2021

.. .ahhhh good idea!
https://github.com/johannhof/markdown.rs/blob/master/src/parser/mod.rs#L78
this method takes a string (MD document) and returns a Tree of parsed elements. That should work; and yes, in case it does not support custom header anchors, we could add it.
Looking at the project issues though, it looks to be quite buggy still, and it does not mention following any markdown flavor, which means.. it would be quite some work too.
Is there an other library? It looks like not, right?

@becheran
Copy link
Owner Author

becheran commented Jun 4, 2021

ahh.. what about this?

https://github.com/kivikakk/comrak

This looks very promising. It seems to even include metadata of ast nodes to the original document. This is awesome. It states to be GitHub and common mark compatible. For Pandoc markdown I guess we can get this working as well. I might have an hour or two tonight to check this out in detail. Thanks!

@dvtomas
Copy link

dvtomas commented Jun 4, 2021

Nice find. comrak README mentions https://github.com/raphlinus/pulldown-cmark , that might also be worth looking at.

@becheran
Copy link
Owner Author

becheran commented Jun 4, 2021

So I just tried the comark lib and commited on a branch. It does the job of extracing the links.

One thing that puzzles me is the start_line metadata info of parsed Nodes which seems to be zero for all inline elements. Would need to save the line info for the last container node which should be doable.

Another issue is that the column info is missing completely. I will ask the comark author if this info is easy to add and whether the authors would like to add this needed info to their lib.

@becheran
Copy link
Owner Author

becheran commented Jun 4, 2021

@dvtomas I also tried pulldown_cmark and for me this looks even more promising. I do like the lib interface a bit more. More important it is possible to retrieve both the line and column info from the parser result. I created another branch which already contains the replacement.

Two things are still missing though, the reference links seems to be handled differently. Inline html links need also be parsed seperatelly after they where detected

@hoijui
Copy link
Contributor

hoijui commented Jun 5, 2021

Wooow! :O
great @becheran!
the cool thing with your code is, that you can keep multiple parsers around, and switch among them by calling one or the other (changing just one line). so.. sounds like for now, pulldown_cmark makes more sense. I imagine what is missing then is:

  1. collecting/generating Markdown auto-anchors from headers (As there is GFM support, that might already work/be provided)
  2. collecting manual Markdown anchors (Pandoc style # My Title {#the anchor}) (This is available through a not yet merged pull-request: Support heading attribute block (especially ID and classes) pulldown-cmark/pulldown-cmark#522)
  3. extracting HTML links and anchors (and calculate the correct line and column for them, if possible)
  4. (?) choose/use an HTML parsing library, and use it for HTML files and probably inline HTML inside Markdown

Would you like to work further on any of this?
I will probably not do much until monday. Surely nothing with your code until you say you will leave it to me.

@hoijui
Copy link
Contributor

hoijui commented Jun 5, 2021

regarding Raw-HTML ...

With Pandoc, one has to do the same like with pulldown-cmark - to parse inline HTML separately. One issue I found there, which we might also see here, is that a normal HTML link within markdown:

Some text <a href="some_url">link text</a> more text.

gets parsed into an AST roughly like this:

MarkdownText("Some text")
RawHTML("<a href="some_url">")
MarkdownText("link text")
RawHTML("</a>")
MarkdownText("more text.")

So you can't just parse individual RawHTML tokens with the HTML parser and look for complete link definitions like <a href="some_url">link text</a>. Though if looking only for the start tags <a href="some_url">, it would work. So if using an external HTML parser, it is something to check, whether it supports scanning for that.

Or maybe we can use a hack, like extract_links(html_parse(raw_html_from_md + "</a>")), to ensure a potentially open a tag is closed.

@becheran
Copy link
Owner Author

becheran commented Jun 5, 2021

@hoijui
I did finish the comark integration and merged the changes into master. It seems to work pretty good and the existing unit tests helped to test whether the behavior is as expected.

About the open points:

  1. collecting/generating Markdown auto-anchors from headers (As there is GFM support, that might already work/be provided)

Still todo.

  1. collecting manual Markdown anchors (Pandoc style # My Title {#the anchor}) (This is available through a not yet merged pull-request: raphlinus/pulldown-cmark#522)

We could wait or parse a potential text part in the headline to get this? Should be doable even without the pulldown-cmark pull-request being merge. Still TODO

  1. extracting HTML links and anchors (and calculate the correct line and column for them, if possible)

I already did that. I re-used my own link extractor code. Replacing this can also be done if we need it.

  1. (?) choose/use an HTML parsing library, and use it for HTML files and probably inline HTML inside Markdown

Maybe not even required? Do you see issues with the html parser? A search for rust html parsers did not reveal any which do keep track of metadate (line/col) in the parsed structure... Or I did not search long enough...

@becheran
Copy link
Owner Author

becheran commented Jun 5, 2021

With Pandoc, one has to do the same like with pulldown-cmark - to parse inline HTML separately. One issue I found there, which we might also see here, is that a normal HTML link within markdown:

I did add your code example as a unit test and my current solution with pulldown_cmark and my own html parser resolves this as expected. See this commit.

@hoijui
Copy link
Contributor

hoijui commented Jun 6, 2021

NIICE! :D
I'll rebase on master then, and continue there.. this seems like the optimal solution you chose! :-)

It seems to be possible to have a cargo dependency on a specific repo and commit, so I think I will do that for my branch.
If that PR does not get merged, we might have to maintain our own fork, including that PR.
I would prefer that over coming up with an maintaining our own solution for anchors.

pulldown-cmark = {git = "https://github.com/lo48576/pulldown-cmark", branch="feature/heading-attrs"}

@hoijui
Copy link
Contributor

hoijui commented Jun 6, 2021

alternatively to that PR, we could also use this filter:
https://github.com/ShotaroTsuji/markdown-heading-id

I will for now sticking with the PR, as I have that setup already.

@hoijui
Copy link
Contributor

hoijui commented Jun 9, 2021

I am still on it!
It is still advancing. :-)

@hoijui
Copy link
Contributor

hoijui commented Jun 15, 2021

I am kind of stuck with.. rust related issue(s). ;-)
(as in: my lack of rust skills)

I created the AnchorsCache type, and in this function want to add to the cache.
Really, I don;t know how to do it, and the way I try, does not work, cause I do not get out a mutable instance of the cache (or say, the Vec<MarkupAnchorTarget>). plus, the .unwrap().as_ref().unwrap().as_ref().unwrap(). in that specific line where I have the issue, ... looks so bad that I must be miles off.

type AnchorsCache = HashMap<reqwest::Url, Option<reqwest::Result<Vec<MarkupAnchorTarget>>>>;

pub async fn check_http(target: &str, anchors: Option<Vec<MarkupAnchorTarget>>, remote_anchors: &mut AnchorsCache) -> LinkCheckResult {

    debug!("Check http link target {:?}", target);
    let url = reqwest::Url::parse(&target).expect("URL of unknown type");

    match http_request(&url, anchors.is_some()).await {
        Ok((response, parsed_anchors)) => {
            if parsed_anchors.is_some() {
                remote_anchors.get(&url).unwrap().as_ref().unwrap().as_ref().unwrap().append(&mut parsed_anchors.unwrap());
            }
            response
        },
        Err(error_msg) => LinkCheckResult::Failed(format!("Http(s) request failed. {}", error_msg)),
    }
}

do I have to put $mut somewhere? do I have to use a' named lifetime parameter somewhere, somehow?
... please help!

Also, I realized, that surely you will have to refactor a lot, after I am done. I have no problem with that; I see this as a learning experience. If it were not actually something useful coming out of this, I would probably give up on it. ;-)

@becheran
Copy link
Owner Author

@hoijui try get_mut on the map instead of get

@hoijui
Copy link
Contributor

hoijui commented Jun 16, 2021

oh, thanks! :-)
.. that is certainly required, but did not yet do the trick (leaving the rest as is), will also try more...

@dvtomas
Copy link

dvtomas commented Jun 16, 2021

I think one of the problematic things is the type signature of type AnchorsCache = HashMap<reqwest::Url, Option<reqwest::Result<Vec<MarkupAnchorTarget>>>>;

  1. Why is there an Option<Result<...>> ? The Option seems redundant to me in a hashmap, where the absence of value is usually expressed just by not having the appropriate key at all. What is the meaning of a None value in your case?
  2. Why is there a Result inside the Option? Do you really want to cache errors as well? It may be so, but make sure it is the desired behavior.
  3. Another strange thing is that you are not actually using the AnchorsCache to test if the value is already cached. Is that intentional?
  4. The very strange thing is that you are appending the fetched anchors from an URL to those already in cache. Does it make any sense? I would say that there is a fixed set of anchor targets for each url (the hashmap key), and the correct way of resolving them is just replacing them when a newer set of targets has been retrieved. Is that not your case? My guess is that this sole modification would get rid of all the unwraps in there.
  5. Don't use unwrap()s. If there really is a good reason to retrieve a value which may panic (either you are OK with panic, or you are absolutely sure that the panic can't occur), use .expect("This is the reason why I can use a panicking get here") instead of .unwrap().

I may be very wrong on these points, but they are something that seems funny to me. Maybe there's a legitimate reason for all of them I don't see from a distant glance.

@hoijui
Copy link
Contributor

hoijui commented Jun 16, 2021

ahh thanks @dvtomas ! :-)

I think one of the problematic things is the type signature of type AnchorsCache = HashMap<reqwest::Url, Option<reqwest::Result<Vec<MarkupAnchorTarget>>>>;

1. Why is there an Option<Result<...>> ? The Option seems redundant to me in a hashmap, where the absence of value is usually expressed just by not having the appropriate key at all. What is the meaning of a None value in your case?

Can;t think of anything rihgt now.. I think you are right. ;-)

2. Why is there a Result inside the Option? Do you really want to cache errors as well? It may be so, but make sure it is the desired behavior.

Yes, I want to cache errors: say, a URL is unavailable, i want to store that, and not retry fetching it all the time.

3. Another strange thing is that you are not actually using the AnchorsCache to test if the value is already cached. Is that intentional?

I think I just wanted to get .. something to compile, and then would have added this later.
(... just trying to recreate my though process of a few days ago. ;-) )

4. The very strange thing is that you are _appending_ the fetched anchors from an URL to those already in cache. Does it make any sense? I would say that there is a fixed set of anchor targets for each url (the hashmap key), and the correct way of resolving them is just _replacing_ them when a newer set of targets has been retrieved. Is that not your case? My guess is that this sole modification would get rid of all the unwraps in there.

Indeed... ;-)

5. Don't use unwrap()s. If there really is a good reason to retrieve a value which may panic (either you are OK with panic, or you are absolutely sure that the panic can't occur), use `.expect("This is the reason why I can use a panicking get here")` instead of `.unwrap()`.

got it, thanks!

I may be very wrong on these points, but they are something that seems funny to me. Maybe there's a legitimate reason for all of them I don't see from a distant glance.

All good you did.. thank you! :-)

@hoijui
Copy link
Contributor

hoijui commented Jun 16, 2021

slightly change AnchorCache, and added docu:

/// If a URL is not stored in the map (the URL does not appear as a key),
/// it means that URL has not yet been checked.
/// If the Result is Err, it means the URL has been checked,
/// but was not available, or anchor parsing has failed.
/// If the Option is None, it means the URL was checked and evaluated as for available,
/// but no parsing of anchors was tried.
/// If the Vec is empty, it means that the document was parsed, but no anchors were found.
type AnchorsCache = HashMap<reqwest::Url, reqwest::Result<Option<Vec<MarkupAnchorTarget>>>>;

@hoijui
Copy link
Contributor

hoijui commented Jun 16, 2021

.. ok, getting the grasph o fit now.. thanks! :-)
I feel like I can go on now.. that was indeed a very bad try there.

@dvtomas
Copy link

dvtomas commented Jun 16, 2021

Most of it makes sense. Maybe you'll want to have a dedicated error like

enum Error {
  RetrieveError(reqwest::Error),
  ParseError(...)
}

but you'll yet see if that's really necessary.

What really looks odd to me is the If Option is None, it means the URL was checked and evaluated as for available,
but no parsing of anchors was tried.
How can such a situation happen? And more importantly, why would you want to cache such a state?

Anyway, perhaps even replacing the whole compound value type with something like

enum CacheEntry {
  Parsed {
     errors: Vec<String>,
     anchors: Vec<MarkupAnchorTarget>,
  },
  RetrieveError(reqwest::Error),
  RetrievedButNotParsed,
}

would be a more fitting and expressive description of the relevant states? It even allows to remember errors during parsing, while also storing at least some of the anchors if the parser is lenient enough. There are no Results of Options or whatever, everything is flat and readable. What do you think?

@becheran
Copy link
Owner Author

I do like the proposal of @dvtomas using an enum return value. I should use the enums type much more frequently in rust. It is so much more powerful compared to other languages. Using enums makes the state clean even without documenting it.

I am also not 100% sure why you would need the but no parsing of anchors was tried state @hoijui. When does this happen? When would you directly parse it and when not?

@hoijui
Copy link
Contributor

hoijui commented Jun 16, 2021

I do like the proposal of @dvtomas using an enum return value. I should use the enums type much more frequently in rust. It is so much more powerful compared to other languages. Using enums makes the state clean even without documenting it.

ok, will do too! :-)

I am also not 100% sure why you would need the but no parsing of anchors was tried state @hoijui. When does this happen? When would you directly parse it and when not?

If the cache is going to be serialized and reloaded, for example.
in one run, you might not link to fileX.md with any anchors, in the next run you will. so in the first run, you would only check availability of the URL, in the second run, you would have to download the whole file and parse it. That for, I do this distinction.

@becheran
Copy link
Owner Author

If the cache is going to be serialized and reloaded, for example.

Ah got it. Did not have this usecase in mind. Though, I wonder for how long those caches will be valid? How can you tell that in the next run the file did not change?

@hoijui
Copy link
Contributor

hoijui commented Jun 18, 2021

I can not tell that, but the user should have the choice, through an optional flag.
Let' say, I am building a lot of documentation directories, or I am building 100 projects, each on average 100 times a day, or I know I'll be working on a documentation offline for a week, but want to be able to test it's quality while I go, or maybe we will support scanning multiple projects in one execution of mlc at some time, or a list of individual files (without interconnection), instead of a dir containing a documentaiton.

@becheran
Copy link
Owner Author

@hoijui are you still working on this issue?

@hoijui
Copy link
Contributor

hoijui commented Sep 13, 2021

hey!
uhhh sorry for the late answer! :/
I was not working on this, because .. frankly, it got really ugly (code-design wise).
there are good news too, though:
Most of the last few weeks, I was writing other rust software, and since friday I feel like.. I got the basic hang of it. I can not write code without relying purely on the compiler telling me what to do. ;-)
and the other good news is: I am now employed .. payed, to write all this. this extension to mlc is part of it, so sooner or later it will happen. I might rewrite much of what i have written so far, as my understanding of how to do things properly is much better now (maybe not yet good, but... usable).
I also just pushed all I did so far, just in case something out of the ordinary happens an I never come back here, though this is all work from before my recent skill improvements, and it might not be wort building uppon, but. it's there.
.. and please feel free to ask again, anytime.. it might make me take it up then. but probably not for an other month from now.

@becheran
Copy link
Owner Author

@hoijui hey, nice to hear. Congrats for the new job! Sounds great!
All right. Just contact me if you want input or need help.

@hoijui
Copy link
Contributor

hoijui commented Sep 23, 2021

NOTE (mostly) to self:
Should use the url crate for parsing:
https://docs.rs/url/2.2.2/url/
It even supports parsing relative URLs, meaning, we can use it to parse local links too. that should reduce complexity quite bit (and improve stability for edge-cases). under the link above, relative URL parsing is shown as well; it requires supplying a base, which could be "file:///path/to/project/root".

NOTE twoo:
I should call this thing "fragment", not "anchor", as it seems to be the more official/formal/correct term.

@hoijui
Copy link
Contributor

hoijui commented Nov 16, 2021

..looks like I am working on this again now...
hate to look at the code I did so far, so I will probably start from fresh, now with better rust knowledge, copying code and ideas from my last try when appropriate.

... oh NOTEs, great!

@hoijui
Copy link
Contributor

hoijui commented Mar 24, 2023

I just rebased my stuff on top of your changes until now; was some work. :-)

one question: Why do you squash commits, e.g. from pull-requests or your own feature branches?
I can understand it if it is just a bunch of messy commits, not nicely split up, but if it is half-way ok split up, I feel it is a big loss to do that. there are ways to show the history as if merges were squashed, if it is just for you not wanting to see it.
I personally do not consider me as a writer of code, but as a writer of history of code. it is my primary (more or less) creative product. destroying the history ... pains me.
"... the children, the children! think of the children!" ;-)
I personally do not submit messy commits as a pull-request, and I have not seen you work messy in any of your feature branches either.

(with messy I mean stuff like: "Try to fix 1", "Try to fix again", "Fix typo in try to fix 1", ... )

@becheran
Copy link
Owner Author

Hey, great to hear. 😌 I don't have too many feelings about my history. For feature and bugfix branches I simply squash most of the time because I think a single commit in the history is enough to explain what I did. But I am also fine to not squash yours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants