Check Documentation Links #2042

zuphilip · 2016-04-28T18:50:50Z

A discussion about how to check documentation links started on twitter: https://twitter.com/adam42smith/status/725749988702179329 CC @adam3smith @rmzelle @inukshuk

I am quite sure there are a lot of different ideas and solutions, but maybe we should first do some requirement engineering. What do we actually want?

When do we want to check documentation links?
Which links do we want to check?
Which header code do we want to check (or maybe just avoid 404)?
What should happen if a check fails?

adam3smith · 2016-04-28T20:51:25Z

When to check?

periodically but I don't think via CI. I'd imagine the main problem here is linkrot, not wrong new documentation links that 404. That's why I was thinking about

Which links?

all of them (same reason as above)

Which header code?

I liked checking for both 404 and redirects (and have them in separate categories) that should get us quite far.

What to do with fails?

I just want a list we can work through

rmzelle · 2016-04-28T20:55:21Z

We also don't want 404s to result in failing Travis builds, since such failures might be temporary (servers that are down) or affect styles other than the ones changed in PRs. Would easily end up being very confusing for contributors.

adam3smith · 2016-04-28T21:06:20Z

right -- in general we'd want CI errors and warnings to only apply to the current PR, which is why I was thinking script might be better.

fbennett · 2016-04-28T21:24:41Z

Another possibility would be to use perma.cc:

https://perma.cc/

adam3smith · 2016-04-28T21:26:30Z

since in this case changing links may mean changing styles, I don't think perma.cc is what we want (or am I misremembering what that does?)

rmzelle · 2016-04-28T21:28:21Z

Yeah, permalinks don't seem very useful. They hide the destination in the style, and we'd have linkrot of permalink targets instead of our own links.

zuphilip · 2016-04-28T21:35:51Z

If a documentation link fails then something else might also happen, e.g. they updated their style requirements. Thus, I guess we really want to capture these cases and then do something. What do we want to do then? If we check the documentation links of the whole repo, then we might end up with hundreds of failed links. Can we update them all manually? Just deleting them seems not helpful either...

adam3smith · 2016-04-28T21:38:04Z

Yes, I think we want a list of all failures and to go through them gradually, starting with 404s. Don't see an alternative. After the first pass, if we do this twice a year it should be pretty quick.

fbennett · 2016-04-28T21:48:21Z

A perma.cc page archives the page as written, a snapshot, and the original
URL. It's an archival tool, so the saved page content doesn't reflect
subsequent changes - but it does protect against complete loss of the style
guide against which a style was prepared. It could be combined with a
link-check script to detect gone-dead links.
On Apr 29, 2016 06:26, "Sebastian Karcher" notifications@github.com wrote:

since in this case, changing links may mean changing styles, so I don't
think perma.cc is what we want (or am I misremembering what that does?)

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#2042 (comment)

zuphilip · 2016-04-28T22:11:33Z

Here is some first hack for a script (actually a one-liner) which you can run on a bash:

$ grep -Poh '[^"]*(?=" rel="documentation")' *.csl | xargs curl -ILv

The output (only first 100 urls) is not yet pretty and you have to extract the information there somehow. You can search for example for "404 Not Found". Is this what we want?

fbennett · 2016-04-28T22:39:55Z

If the script were to extract links and set them as anchors in an (ephemeral) local index page, you could run LinkChecker over the page to get a nicely formatted report on redirects and bad links.

https://wummel.github.io/linkchecker/

fbennett · 2016-04-29T02:22:38Z

On second thought, linkchecker might not be at all good for this. I cobbled some code together and ran a full report against the independent styles. I'll attach the output in case it's of interest, but as you can see, the checker trips on lots of anomalies (bad certificates, mysterious server errors) that don't prevent a browser from accessing the page.
linkcheck.zip

fbennett · 2016-04-29T02:33:39Z

The consistent errors from Wiley (500 Internal Server Error) are caused by their site configuration, which rejects HEAD requests. With curl and the -I option, you get the same result.

zuphilip · 2016-04-29T07:11:08Z

Here is an (incomplete) statistic from Frank`s result:

Valid : 247
- 200 OK : 187
- Valid: syntax OK : 59
Error : 657
- 404 Not Found : 69
- 403 Forbidden : 13
- 500 Internal Server Error : 147
- 301 (moved permanent) : 66
- Error: SSLError : 102
- Error: timeout : 158
- Error: URLError : 140

I guess that we could deal with some technical barriers by choosing another attempt. E.g. for Wiley instead of HEAD calls we could use full calls, e.g. curl -v "http://onlinelibrary.wiley.com/journal/10.1111/(ISSN)1529-8817/homepage/ForAuthors.html".

However, the question for me is, what can we do with the result?

Let me give you a specific example: In ambio.csl we have found a 404 documentation link http://www.springer.com/cda/content/document/cda_downloaddocument/Instructions_for_authors_AMBIO_2013.pdf . Would we then look for an updated documentation link at springer? Yes, there exists a newer version of the style requirement: http://www.springer.com/cda/content/document/cda_downloaddocument/Instructions_for_authors_AMBIO_2015.pdf?SGWID=0-0-45-960937-p173951212 . But what can we do then? We cannot simply replace the link, because the style requirements changed. However, I guess it is also impossible to check these documentation closely and update all the CSL styles. Or do I miss something here?

fbennett · 2016-04-29T10:55:01Z

A style could also change significantly without a change to its URL - or the URL could change but without any change to the style.

I'm not pushing legal tech solutions (really!), but these issues do bring to mind scotus-servo a little tool built by David Zvenyach before he joined 18F. It has a narrow purpose, works only on PDFs, and I'm not sure how it does text-diffing, but to detect some change to a Supreme Court opinion, it uses an MD5 checksum (or similar - it pushes the PDF into git, and then reads back its blob hash).

Not to harp on perma.cc (really!), but if they provided a flag showing whether the doc at the live link differs from the doc at the time of archiving, it would solve half of these issues. They just received a large grant to expand their service, and might be open to suggestions for added functionality. Alternatively, you could check for changes in a script well enough and with very little effort by saving a checksum.

(Granted that this doesn't address Philip's concern about how to react to style and URL changes, though.)

adam3smith · 2016-04-29T13:34:11Z

If this helps us to identify changing styles, I think that's a bonus feature, not a bug. So if Ambio has changed, what should happen is that we create an issue for that (and eventually work through it and then replace the link accordingly).

rmzelle · 2016-04-29T13:41:05Z

Regarding perma.cc, I'm not sure we could get an unlimited account. https://perma.cc/docs/faq#general says:

Anyone can sign up for a free Perma.cc account, which you can use to preserve up to 10 records per month. To preserve unlimited records, you have to be a member of an archiving organization sponsored by a registrar.

Anyway, I think that the primary function of the "documentation" URLs is to point to the relevant journal and/or style guide. If we can identify broken URLS and update them, we should, even if the style changed. It would still be an improvement.

zuphilip · 2016-04-29T14:28:33Z

Here are the 404 errors, if someone would like to start to work on them: https://gist.github.com/zuphilip/58a4d391fc71d2530151eea6c8117fec , but maybe it is easier to start with the 301 errors. IMO it can be really time consuming to go through some 404 cases...

For a perspective we should have think about a possibility to compare two versions of a style requirements together. I don't know much about perma.cc, so saving a snapshot is good, but changing urls is not what we are after. The way back machine offers another possibility to save a copy of any page (if crawlers are allowed) and you don't have to register or anything. How about a web hook after merging/pulling commits which calls for each documentation url this service, i.e.

http://web.archive.org/save/{url}

? Then we can be sure, that at any point later, we will still find the documentation it was built on at the way back machine.

stale · 2018-12-28T15:54:25Z

This issue hasn't seen any activity in the past 30 days. It will be automatically closed if no further activity occurs in the next two weeks.

stale bot added the waiting The ticket/pull request is awaiting input from the contributor/depositor label Dec 28, 2018

adam3smith added repository quality-control and removed waiting The ticket/pull request is awaiting input from the contributor/depositor labels Dec 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check Documentation Links #2042

Check Documentation Links #2042

zuphilip commented Apr 28, 2016

adam3smith commented Apr 28, 2016

rmzelle commented Apr 28, 2016

adam3smith commented Apr 28, 2016

fbennett commented Apr 28, 2016

adam3smith commented Apr 28, 2016 •

edited

rmzelle commented Apr 28, 2016

zuphilip commented Apr 28, 2016

adam3smith commented Apr 28, 2016

fbennett commented Apr 28, 2016

zuphilip commented Apr 28, 2016

fbennett commented Apr 28, 2016

fbennett commented Apr 29, 2016

fbennett commented Apr 29, 2016 •

edited

zuphilip commented Apr 29, 2016 •

edited

fbennett commented Apr 29, 2016 •

edited

adam3smith commented Apr 29, 2016

rmzelle commented Apr 29, 2016

zuphilip commented Apr 29, 2016

stale bot commented Dec 28, 2018

Check Documentation Links #2042

Check Documentation Links #2042

Comments

zuphilip commented Apr 28, 2016

adam3smith commented Apr 28, 2016

rmzelle commented Apr 28, 2016

adam3smith commented Apr 28, 2016

fbennett commented Apr 28, 2016

adam3smith commented Apr 28, 2016 • edited

rmzelle commented Apr 28, 2016

zuphilip commented Apr 28, 2016

adam3smith commented Apr 28, 2016

fbennett commented Apr 28, 2016

zuphilip commented Apr 28, 2016

fbennett commented Apr 28, 2016

fbennett commented Apr 29, 2016

fbennett commented Apr 29, 2016 • edited

zuphilip commented Apr 29, 2016 • edited

fbennett commented Apr 29, 2016 • edited

adam3smith commented Apr 29, 2016

rmzelle commented Apr 29, 2016

zuphilip commented Apr 29, 2016

stale bot commented Dec 28, 2018

adam3smith commented Apr 28, 2016 •

edited

fbennett commented Apr 29, 2016 •

edited

zuphilip commented Apr 29, 2016 •

edited

fbennett commented Apr 29, 2016 •

edited