Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check Documentation Links #2042

Open
zuphilip opened this issue Apr 28, 2016 · 19 comments
Open

Check Documentation Links #2042

zuphilip opened this issue Apr 28, 2016 · 19 comments

Comments

@zuphilip
Copy link
Member

A discussion about how to check documentation links started on twitter: https://twitter.com/adam42smith/status/725749988702179329 CC @adam3smith @rmzelle @inukshuk

I am quite sure there are a lot of different ideas and solutions, but maybe we should first do some requirement engineering. What do we actually want?

  • When do we want to check documentation links?
  • Which links do we want to check?
  • Which header code do we want to check (or maybe just avoid 404)?
  • What should happen if a check fails?
@adam3smith
Copy link
Member

When to check?

  • periodically but I don't think via CI. I'd imagine the main problem here is linkrot, not wrong new documentation links that 404. That's why I was thinking about

Which links?

  • all of them (same reason as above)

Which header code?

  • I liked checking for both 404 and redirects (and have them in separate categories) that should get us quite far.

What to do with fails?

  • I just want a list we can work through

@rmzelle
Copy link
Member

rmzelle commented Apr 28, 2016

We also don't want 404s to result in failing Travis builds, since such failures might be temporary (servers that are down) or affect styles other than the ones changed in PRs. Would easily end up being very confusing for contributors.

@adam3smith
Copy link
Member

right -- in general we'd want CI errors and warnings to only apply to the current PR, which is why I was thinking script might be better.

@fbennett
Copy link
Member

Another possibility would be to use perma.cc:

https://perma.cc/

@adam3smith
Copy link
Member

adam3smith commented Apr 28, 2016

since in this case changing links may mean changing styles, I don't think perma.cc is what we want (or am I misremembering what that does?)

@rmzelle
Copy link
Member

rmzelle commented Apr 28, 2016

Yeah, permalinks don't seem very useful. They hide the destination in the style, and we'd have linkrot of permalink targets instead of our own links.

@zuphilip
Copy link
Member Author

If a documentation link fails then something else might also happen, e.g. they updated their style requirements. Thus, I guess we really want to capture these cases and then do something. What do we want to do then? If we check the documentation links of the whole repo, then we might end up with hundreds of failed links. Can we update them all manually? Just deleting them seems not helpful either...

@adam3smith
Copy link
Member

Yes, I think we want a list of all failures and to go through them gradually, starting with 404s. Don't see an alternative. After the first pass, if we do this twice a year it should be pretty quick.

@fbennett
Copy link
Member

A perma.cc page archives the page as written, a snapshot, and the original
URL. It's an archival tool, so the saved page content doesn't reflect
subsequent changes - but it does protect against complete loss of the style
guide against which a style was prepared. It could be combined with a
link-check script to detect gone-dead links.
On Apr 29, 2016 06:26, "Sebastian Karcher" notifications@github.com wrote:

since in this case, changing links may mean changing styles, so I don't
think perma.cc is what we want (or am I misremembering what that does?)


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#2042 (comment)

@zuphilip
Copy link
Member Author

Here is some first hack for a script (actually a one-liner) which you can run on a bash:

$ grep -Poh '[^"]*(?=" rel="documentation")' *.csl | xargs curl -ILv

The output (only first 100 urls) is not yet pretty and you have to extract the information there somehow. You can search for example for "404 Not Found". Is this what we want?

@fbennett
Copy link
Member

If the script were to extract links and set them as anchors in an (ephemeral) local index page, you could run LinkChecker over the page to get a nicely formatted report on redirects and bad links.

https://wummel.github.io/linkchecker/

@fbennett
Copy link
Member

On second thought, linkchecker might not be at all good for this. I cobbled some code together and ran a full report against the independent styles. I'll attach the output in case it's of interest, but as you can see, the checker trips on lots of anomalies (bad certificates, mysterious server errors) that don't prevent a browser from accessing the page.
linkcheck.zip

@fbennett
Copy link
Member

fbennett commented Apr 29, 2016

The consistent errors from Wiley (500 Internal Server Error) are caused by their site configuration, which rejects HEAD requests. With curl and the -I option, you get the same result.

@zuphilip
Copy link
Member Author

zuphilip commented Apr 29, 2016

Here is an (incomplete) statistic from Frank`s result:

  • Valid : 247
    • 200 OK : 187
    • Valid: syntax OK : 59
  • Error : 657
    • 404 Not Found : 69
    • 403 Forbidden : 13
    • 500 Internal Server Error : 147
    • 301 (moved permanent) : 66
    • Error: SSLError : 102
    • Error: timeout : 158
    • Error: URLError : 140

I guess that we could deal with some technical barriers by choosing another attempt. E.g. for Wiley instead of HEAD calls we could use full calls, e.g. curl -v "http://onlinelibrary.wiley.com/journal/10.1111/(ISSN)1529-8817/homepage/ForAuthors.html".

However, the question for me is, what can we do with the result?

Let me give you a specific example: In ambio.csl we have found a 404 documentation link http://www.springer.com/cda/content/document/cda_downloaddocument/Instructions_for_authors_AMBIO_2013.pdf . Would we then look for an updated documentation link at springer? Yes, there exists a newer version of the style requirement: http://www.springer.com/cda/content/document/cda_downloaddocument/Instructions_for_authors_AMBIO_2015.pdf?SGWID=0-0-45-960937-p173951212 . But what can we do then? We cannot simply replace the link, because the style requirements changed. However, I guess it is also impossible to check these documentation closely and update all the CSL styles. Or do I miss something here?

@fbennett
Copy link
Member

fbennett commented Apr 29, 2016

A style could also change significantly without a change to its URL - or the URL could change but without any change to the style.

I'm not pushing legal tech solutions (really!), but these issues do bring to mind scotus-servo a little tool built by David Zvenyach before he joined 18F. It has a narrow purpose, works only on PDFs, and I'm not sure how it does text-diffing, but to detect some change to a Supreme Court opinion, it uses an MD5 checksum (or similar - it pushes the PDF into git, and then reads back its blob hash).

Not to harp on perma.cc (really!), but if they provided a flag showing whether the doc at the live link differs from the doc at the time of archiving, it would solve half of these issues. They just received a large grant to expand their service, and might be open to suggestions for added functionality. Alternatively, you could check for changes in a script well enough and with very little effort by saving a checksum.

(Granted that this doesn't address Philip's concern about how to react to style and URL changes, though.)

@adam3smith
Copy link
Member

If this helps us to identify changing styles, I think that's a bonus feature, not a bug. So if Ambio has changed, what should happen is that we create an issue for that (and eventually work through it and then replace the link accordingly).

@rmzelle
Copy link
Member

rmzelle commented Apr 29, 2016

Regarding perma.cc, I'm not sure we could get an unlimited account. https://perma.cc/docs/faq#general says:

Anyone can sign up for a free Perma.cc account, which you can use to preserve up to 10 records per month. To preserve unlimited records, you have to be a member of an archiving organization sponsored by a registrar.

Anyway, I think that the primary function of the "documentation" URLs is to point to the relevant journal and/or style guide. If we can identify broken URLS and update them, we should, even if the style changed. It would still be an improvement.

@zuphilip
Copy link
Member Author

Here are the 404 errors, if someone would like to start to work on them: https://gist.github.com/zuphilip/58a4d391fc71d2530151eea6c8117fec , but maybe it is easier to start with the 301 errors. IMO it can be really time consuming to go through some 404 cases...

For a perspective we should have think about a possibility to compare two versions of a style requirements together. I don't know much about perma.cc, so saving a snapshot is good, but changing urls is not what we are after. The way back machine offers another possibility to save a copy of any page (if crawlers are allowed) and you don't have to register or anything. How about a web hook after merging/pulling commits which calls for each documentation url this service, i.e.

http://web.archive.org/save/{url}

? Then we can be sure, that at any point later, we will still find the documentation it was built on at the way back machine.

@stale
Copy link

stale bot commented Dec 28, 2018

This issue hasn't seen any activity in the past 30 days. It will be automatically closed if no further activity occurs in the next two weeks.

@stale stale bot added the waiting The ticket/pull request is awaiting input from the contributor/depositor label Dec 28, 2018
@adam3smith adam3smith added repository quality-control and removed waiting The ticket/pull request is awaiting input from the contributor/depositor labels Dec 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants