Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling SSL Support #1

Open
tedsecretsource opened this issue Mar 5, 2020 · 11 comments
Open

Enabling SSL Support #1

tedsecretsource opened this issue Mar 5, 2020 · 11 comments

Comments

@tedsecretsource
Copy link

First of all, I am thrilled to see someone picking up the ball on this project. I love(d) linklint and was really disappointed when I realized I couldn't use it reliably with HTTPS web sites, and thus, my query: How can I use this version of linklint to check links on an HTTPS web site?

TIA and thanks a lot for giving this a go!

@bfmartin
Copy link
Owner

bfmartin commented Mar 5, 2020

Hi. I'm glad you appreciate this package as well.

This version can connect to https already. There's an additional command line option "-https" to send but otherwise it works as for http.

Have you tried this? If it doesn't work for you, please let me know.

Thank you.

@tedsecretsource
Copy link
Author

Hi,

That seemed to work, but it's only checking the main page (and not really even parsing it as far as I can tell). I suspect there may be a robot restriction in place. I seem to recall that I used to be able to change the User-Agent but -http_header is not recognized (although it is in the documentation).

I'd be happy to submit PRs but my Perl skills are virtually non-existent. I'm posting a couple of files of output to see if you can spot the problem.

Given this command file:

-db7
-host www.chicagoitsystems.com
-doc chicagoitsystems
-https
-limit 1000
-htmlonly
/@

I get the following result

file: index.txt
host: www.chicagoitsystems.com
date: Thu, 05 Mar 2020 23:07:28 (local)
Linklint version: 3.0.2

 summary.txt: summary of results
     log.txt: log of progress
   ignore.txt: -----   1 ignored file
  ignoreX.txt: -----   1 ignored file (cross referenced)
     warn.txt: warn    1 warning
    warnX.txt: warn    1 warning (cross referenced)
    warnF.txt: warn    1 file with warnings
 httpfail.txt: -----   1 link: failed via http

warn.txt contains:

file: warn.txt
host: www.chicagoitsystems.com
date: Thu, 05 Mar 2020 23:07:28 (local)
Linklint version: 3.0.2

#------------------------------------------------------------
# warn    1 warning
#------------------------------------------------------------
no status. Will try GET method

And httpfail.txt contains:

file: httpfail.txt
host: www.chicagoitsystems.com
date: Thu, 05 Mar 2020 23:07:28 (local)
Linklint version: 3.0.2

#------------------------------------------------------------
# -----   1 link: failed via http
#------------------------------------------------------------
1 url: no status. Will try GET method
    /

Any ideas what I'm doing wrong?

Thanks in advance!

@bfmartin
Copy link
Owner

bfmartin commented Mar 6, 2020

Checking. Your site behaves differently than mine.

@tedsecretsource
Copy link
Author

In what regard? Asking because we have about 150 sites all configured the same on the same server. We've always understood that our server configuration is pretty standard. If linklint can't handle a server with a standard configuration, maybe it could somehow be updated? Do you have any test servers that are known to work that you can share?

@bfmartin
Copy link
Owner

bfmartin commented Mar 7, 2020

In the regard that linklint works for my sites, but not for yours.

For example, this works (my site):
linklint -host www.bfmartin.ca /@ -https

This does not:
linklint -host www.chicagoitsystems.com /@ -https

I note that the wget command works for both domains.

wget https://www.bfmartin.ca/
wget https://www.chicagoitsystems.com/

However, the HEAD command (also in Perl) does not work for www.chicagoitsystems.com. It returns a 403 Forbidden error. That's possibly a result of using Cloudflare.

HEAD https://www.bfmartin.ca/
HEAD https://www.chicagoitsystems.com/

Can you test linklint against your website without going through Cloudflare? Is it different?

@tedsecretsource
Copy link
Author

HEAD returns 403 Forbidden.

That's very odd but I'm going to guess it's the server (SiteGround) firewall. I can't see the return code in the output files (an enhancement I'd love to add, btw) but with Cloudflare disabled I'm not seeing any difference.

FWIW, curl -I https://www.chicagoitsystems.com/ returns this regardless of whether or not Cloudflare is enabled:

curl -I https://www.chicagoitsystems.com/
HTTP/2 200 
date: Sun, 08 Mar 2020 14:24:16 GMT
content-type: text/html; charset=UTF-8
set-cookie: __cfduid=dd591971622f69b1cedfe9e97382b2b321583677456; expires=Tue, 07-Apr-20 14:24:16 GMT; path=/; domain=.chicagoitsystems.com; HttpOnly; SameSite=Lax
cache-control: max-age=864000
cf-railgun: 22c13071f0 stream 0.000000 0200 206c
expires: Wed, 18 Mar 2020 14:24:16 GMT
link: <https://www.chicagoitsystems.com/wp-json/>; rel="https://api.w.org/", <https://www.chicagoitsystems.com/>; rel=shortlink
vary: User-Agent
x-pingback: https://www.chicagoitsystems.com/xmlrpc.php
cf-cache-status: DYNAMIC
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
server: cloudflare
cf-ray: 570d38052a60fefc-MAD

Could you send me the exact Perl command you are sending with instructions on how to execute it, if needed? My server is running HTTP/2.0. Is it possible that linklint is parsing the response string and since it is not finding HTTP/1.n it is giving up? I am seeing this in the warning.txt file: no status. Will try GET method (although it's odd that I don't then see a result from the GET method either, that I can discern).

Also, I'm pretty sure that part of the issue is the User-Agent. IIRC, linklint used to allow you to change the User-Agent but I'm unable to figure out how. Any ideas there?

Thanks for helping figure this out. Hopefully whatever resolution we find will help others.

@tedsecretsource
Copy link
Author

Actually, I now see how LinkLint generates the User-Agent:

$UserAgent = "LinkLint-$agent";

As it could be used for abuse, I'll just change my local version of the source code to pass in the User-Agent of an actual browser. I know for a fact that on another server we manage linklint is being blocked simply because of the User-Agent.

@tedsecretsource
Copy link
Author

I've been doing some testing. I'm beginning to believe it is an issue with sites running on nginx vs. apache. For example, the following sites, believed to be running apache, seem to work:

And the following sites, all using nginx seem to fail with a warning (and from time to time an error)

My testing has not be exhaustive and there is no reasonable reason why the type of server should be the culprit if they are all speaking the same protocol but this does seem to be a deciding factor. If I have some more time later I'll try setting up a local nginx server and examine the headers more closely to see if I can spot what's failing.

Thanks again for listening :-)

@bfmartin
Copy link
Owner

bfmartin commented Mar 8, 2020

Wow, ok. I hadn't run in to the issue with the type of server. Maybe it is, as you say, related to the HTTP protocol version number.

I notice that linklint does not use a standard perl module for issuing requests, instead it rolled its own. That code should be replaced with a module.

Error handling in linklint is also poor, and needs to show better messages when it fails. Likely, adding a standard HTTP module would help with error handling.

It's on my todo list.

Is there anything else you were looking for?

@tedsecretsource
Copy link
Author

I think that the response code is an important thing to include in any report for all URLs (if it's not included already). For now, that's pretty much it. Linklint has always produced just the right amount of data for my needs.

As you stated, linklint needs a standard HTTP Request module and I bet that will resolve a majority of issues. This one looks like a likely candidate for use: https://metacpan.org/pod/HTTP::Request

If I get some time, I'll try and code it up myself and submit a PR (but don't hold your breath 😆).

@bfmartin
Copy link
Owner

bfmartin commented Mar 8, 2020

Thank you for reporting this issue, and helping me work through it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants