Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support link whitelisting #2

Closed
chalin opened this issue Dec 8, 2016 · 10 comments
Closed

support link whitelisting #2

chalin opened this issue Dec 8, 2016 · 10 comments
Assignees

Comments

@chalin
Copy link
Collaborator

chalin commented Dec 8, 2016

As an example of where this would be useful is when running the checker over https://webdev.dart-lang.org. We currently do not yet have an Angular guide for the Router, but we do have some Angular pages that already link into the (soon to be created) Router page. It would be great if we could whitelist links to the router page.

As an example the broken-link-checker has an excludeKeywords option. We use it like this under angular.io (note the value of the exclude array variable):

gulp.task('link-checker', () => {
  var method = 'get'; // the default 'head' fails for some sites
  var exclude = [
    // Dart API docs aren't working yet; ignore them
    '*/dart/latest/api/*',
    // Somehow the link checker sees ng1 {{...}} in the resource page; ignore it
    'resources/%7B%7Bresource.url%7D%7D',
    // API docs have links directly into GitHub repo sources; these can
    // quickly become invalid, so ignore them for now:
    '*/angular/tree/*',
    // harp.json "bios" for "Ryan Schmukler", URL isn't valid:
    'http://slingingcode.com'
  ];
  var blcOptions = { requestMethod: method, excludedKeywords: exclude};
  return linkChecker({ blcOptions: blcOptions });
});

cc @kwalrath @kevmoo

@filiph filiph self-assigned this Dec 8, 2016
@chalin
Copy link
Collaborator Author

chalin commented Dec 8, 2016

Another example:

External link http://caniuse.com/#feat=shadowdom failed: http://caniuse.com/#feat=shadowdom exists, but the hash 'feat=shadowdom' does not

The link is valid, but the checker cannot make sense of this particular use of an anchor/fragment, so it is likely a good candidate for whitelisting.

@filiph
Copy link
Owner

filiph commented Dec 8, 2016

Will you be invoking linkcheck from the command line (like in a shell script)? In that case, how would you prefer to give the excluded regexps? As a separate text file?

linkcheck :4001 -x exclude.txt

Does that seem reasonable? The other option is to provide it in line, but that makes the invocation ugly and brittle.

If this configuration-by-file is okay with you, how would you prefer the exclude.txt file to look? Regexp per line, no comments? YAML? For example, have you ever wanted to have more structure in the exclude = [ ... ] option? Can you imagine needing something more than lines?

Also, does it need to be RegExp or should we use glob to make the writing of that file a bit easier?

Last but not least, should this feature be called whitelist or exclude or something else? Whitelist seems confusing to me, but so can exclude, I guess.

@chalin
Copy link
Collaborator Author

chalin commented Dec 8, 2016

This is an example of linkcheck output that actually shows an error, despite the link being valid:

- http://localhost:4001/tools/dart2js
  *  External link https://developer.apple.com/library/safari/documentation/AppleApplications/Conceptual/Safari_Developer_Guide/Debugger/Debugger.html#//apple_ref/doc/uid/TP40007874-CH5-SW1 failed: response code 0 means something's wrong.
             It's possible libcurl couldn't connect to the server or perhaps the request timed out.
             Sometimes, making too many requests at once also breaks things.
             Either way, the return message (if any) from the server is: SSL connect error

@filiph
Copy link
Owner

filiph commented Dec 8, 2016

I'm confused. Is this output from linkcheck? Or is it just an example of something you'd like to exclude?

@chalin
Copy link
Collaborator Author

chalin commented Dec 8, 2016

This is output from linkcheck (I updated the comment to clarify that).

@chalin
Copy link
Collaborator Author

chalin commented Dec 8, 2016

You ask valid questions. Here are some initial thoughts:

  • Exclusion file or command line option: an exclusion file is good.
  • File format? I've seen a file format where # starts a line comment and otherwise there is a pattern per line.
  • Regex or glob. Both have advantages, though with the broken-link-checker (which supports only globs), I've sometimes missed being able to use regexes. If you are willing to support both then the line format could be: [glob|regex] pattern.
  • Name for exclusion list: I agree that whitelist and exclude can be confusing. Ignore and skip could be valid alternatives as well. E.g. --skip-patterns <file> or --skip-links <pattern-file>.

@filiph
Copy link
Owner

filiph commented Dec 15, 2016

I will assume you want to (A) exclude the links as they are stated in href. The other approach (B) would be to exclude links by their final URL (after redirects). That would mean trying all links by default, just in case they end up being redirected to a non-skipped URL.

I'm implementing (A). Stop me if you'd prefer (B).

@filiph
Copy link
Owner

filiph commented Dec 16, 2016

Ok done, please see this section of the readme. Let me know whether this works for you.

@filiph filiph closed this as completed Dec 16, 2016
@filiph
Copy link
Owner

filiph commented Dec 16, 2016

I should add: pub global activate linkcheck to get the newest version.

@chalin
Copy link
Collaborator Author

chalin commented Dec 16, 2016

Very nice! It seems to be working like a charm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants