New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A much more detailed, granular HTTP mode #24

Merged
merged 75 commits into from Apr 21, 2015

Conversation

Projects
None yet
2 participants
@konklone
Collaborator

konklone commented Apr 13, 2015

OK, the work in this branch does not have sufficient tests. I'm opening it here so I can get a sense from @benbalter whether or not this is going in a direction that's going to be accepted into site-inspector.

The goal of this work is to provide a more detailed and accurate glimpse into the behavior of domains at all four standard combinations of root and protocol:

  • https://example.com
  • http://example.com
  • https://www.example.com
  • http://www.example.com

It is downstream from and replaces the --http flag work I did in #22, so that the --http flag now triggers a very different response object. It's not backwards-compatible, and more flexible about hostname verification and trust validation for TLS.

This comes out of a few needs I have:

  • Domains are not always rational about how redirects between the main 4 combos work, and so site-inspector's idea of a "canonical" domain can miss important details or make incorrect claims about the domain.
  • When measuring HSTS, what matters are the headers and TLS validation on the root domain, not the canonical domain.
  • It's relevant, and in some cases crucial, to detect when the site might be live but have an invalid certificate chain. It's not as relevant, but still helpful, to capture when a site has an HTTPS certificate for an invalid hostname (e.g. an akamai.net hostname).

If this work is too much of a shift for this library, I'll understand, and can probably go repurpose this into a standalone thing. But it seems pretty relevant to the library's goal, and after spending some hours on it, I didn't know how to elegantly merge the original mode and this mode.

Here's an example:

$ ./bin/site-inspector nasa.gov --http
{
  "canonical":"http://www.nasa.gov",
  "canonical_protocol":"https",
  "canonical_endpoint":"www",
  "endpoints":{
    "https":{
      "www":{
        "https_valid":true,
        "status":200,
        "hsts":false,
        "hsts_header":null,
        "redirect":false,
        "redirect_to_external":false,
        "redirect_to_https":false,
        "live":true
      },
      "root":{
        "status":0
      }
    },
    "http":{
      "www":{
        "status":200,
        "hsts":false,
        "hsts_header":null,
        "redirect":false,
        "redirect_to_external":false,
        "redirect_to_https":false,
        "live":true
      },
      "root":{
        "status":0
      }
    }
  },
  "live":true,
  "broken_root":true,
  "broken_www":false,
  "enforce_https":false,
  "redirect":false,
  "hsts":false,
  "hsts_entire_domain":null,
  "hsts_entire_domain_preload":null
}

And another:

$ ./bin/site-inspector whitehouse.gov --http
{
  "canonical":"https://www.whitehouse.gov",
  "canonical_protocol":"https",
  "canonical_endpoint":"www",
  "endpoints":{
    "https":{
      "www":{
        "https_valid":true,
        "status":200,
        "hsts":true,
        "hsts_header":"max-age=3600;include_subdomains",
        "redirect":false,
        "redirect_to_external":false,
        "redirect_to_https":false,
        "live":true
      },
      "root":{
        "https_valid":true,
        "status":301,
        "hsts":false,
        "hsts_header":null,
        "redirect":true,
        "redirect_to":"https://www.whitehouse.gov/",
        "redirect_to_external":true,
        "redirect_to_https":true,
        "live":true
      }
    },
    "http":{
      "www":{
        "status":302,
        "hsts":false,
        "hsts_header":null,
        "redirect":true,
        "redirect_to":"https://www.whitehouse.gov/",
        "redirect_to_external":true,
        "redirect_to_https":true,
        "live":true
      },
      "root":{
        "status":301,
        "hsts":false,
        "hsts_header":null,
        "redirect":true,
        "redirect_to":"https://www.whitehouse.gov/",
        "redirect_to_external":true,
        "redirect_to_https":true,
        "live":true
      }
    }
  },
  "live":true,
  "broken_root":false,
  "broken_www":false,
  "enforce_https":true,
  "redirect":false,
  "hsts":true,
  "hsts_entire_domain":false,
  "hsts_entire_domain_preload":false
}

I'll annotate the rest inline -- looking for feedback here, and suggestions about how this could be better integrated into the library.

TODOS

  • Add a test for new behavior that for a canonical URL to be HTTPS, the domain must enforce HTTPS for that canonical URL.
  • Uncomment line adding header information per-endpoint to the response.
  • A domain should still be considered a redirector even if it only listens over plain HTTP, or plain HTTPS.

konklone added some commits Apr 13, 2015

Merge branch 'whitehouse' into light
Conflicts:
	test/fixtures/vcr_cassettes/18f_gsa_gov.yml
	test/fixtures/vcr_cassettes/cfpb_gov.yml
	test/fixtures/vcr_cassettes/cio_gov.yml
	test/fixtures/vcr_cassettes/ed_gov.yml
	test/fixtures/vcr_cassettes/github_com.yml
	test/fixtures/vcr_cassettes/nasa_gov.yml
	test/fixtures/vcr_cassettes/whitehouse_gov.yml
	test/fixtures/vcr_cassettes/www_cio_gov.yml
	test/fixtures/vcr_cassettes/www_google_co_uk.yml
Merge branch 'http' into light
Conflicts:
	bin/site-inspector
	lib/site-inspector.rb
@konklone

This comment has been minimized.

Show comment
Hide comment
@konklone

konklone Apr 13, 2015

Collaborator

I should note, this is also downstream of #23, so if/once #21 and #22 are merged, this PR should still merge cleanly.

Collaborator

konklone commented Apr 13, 2015

I should note, this is also downstream of #23, so if/once #21 and #22 are merged, this PR should still merge cleanly.

require "oj"
domain = ARGV[0]
http_mode = (ARGV[1] == "--http")

This comment has been minimized.

@konklone

konklone Apr 13, 2015

Collaborator

I looked long and hard for a good, simple Ruby command line option parser, and eventually figured this was easier. It could be that this mode is so different that it becomes its own binary or something.

@konklone

konklone Apr 13, 2015

Collaborator

I looked long and hard for a good, simple Ruby command line option parser, and eventually figured this was easier. It could be that this mode is so different that it becomes its own binary or something.

This comment has been minimized.

@benbalter

benbalter Apr 20, 2015

Owner

For small scripts like this, ARGV is the best route IMHO. There's nothing great our there, short of something like Mercenary which is a bit heavy weight.

@benbalter

benbalter Apr 20, 2015

Owner

For small scripts like this, ARGV is the best route IMHO. There's nothing great our there, short of something like Mercenary which is a bit heavy weight.

Show outdated Hide outdated bin/site-inspector
puts Oj.dump(details, indent: 2, :mode => :compat)
puts Oj.dump(details, indent: 2, mode: :compat)

This comment has been minimized.

@konklone

konklone Apr 13, 2015

Collaborator

I already noticed some Ruby 1.8 incompatibility, so I assume changes like this are acceptable.

@konklone

konklone Apr 13, 2015

Collaborator

I already noticed some Ruby 1.8 incompatibility, so I assume changes like this are acceptable.

This comment has been minimized.

@benbalter

benbalter Apr 20, 2015

Owner

👍

@benbalter
domain = domain.downcase
domain = domain.sub /^https?\:/, ""
domain = domain.sub /^\/+/, ""
domain = domain.sub /^www\./, ""
@uri = Addressable::URI.parse "//#{domain}"
@domain = PublicSuffix.parse @uri.host
@timeout = options[:timeout] || 10

This comment has been minimized.

@konklone

konklone Apr 13, 2015

Collaborator

Timeout is configurable, but retains its previous default.

@konklone

konklone Apr 13, 2015

Collaborator

Timeout is configurable, but retains its previous default.

uri = @uri.clone
uri.host = "www.#{uri.host}" if www
uri.host = www ? "www.#{uri.host}" : uri.host

This comment has been minimized.

@konklone

konklone Apr 13, 2015

Collaborator

Aesthetic change to this syntax, to parallel the line below it about ssl.

@konklone

konklone Apr 13, 2015

Collaborator

Aesthetic change to this syntax, to parallel the line below it about ssl.

end
def inspect
"<SiteInspector domain=\"#{domain}\">"
end
def uri(ssl=https?,www=www?)
def uri(ssl=enforce_https?,www=www?)

This comment has been minimized.

@konklone

konklone Apr 13, 2015

Collaborator

This is a change to both modes: This means that the "canonical" URI will only end up as https:// if the domain forces HTTPS. For example, the canonical URL for nasa.gov was https://www.nasa.gov -- however, I'd argue that the canonical URL is http://www.nasa.gov, as that's what Google brings you to and how NASA links to stuff.

So now, the canonical URL will be https:// only if the domain enforces HTTPS. I still need to add a test for this behavior, and if you'd rather I spin it off from here, I can.

@konklone

konklone Apr 13, 2015

Collaborator

This is a change to both modes: This means that the "canonical" URI will only end up as https:// if the domain forces HTTPS. For example, the canonical URL for nasa.gov was https://www.nasa.gov -- however, I'd argue that the canonical URL is http://www.nasa.gov, as that's what Google brings you to and how NASA links to stuff.

So now, the canonical URL will be https:// only if the domain enforces HTTPS. I still need to add a test for this behavior, and if you'd rather I spin it off from here, I can.

This comment has been minimized.

@benbalter

benbalter Apr 20, 2015

Owner

👍 Makes sense.

@benbalter

benbalter Apr 20, 2015

Owner

👍 Makes sense.

@@ -49,8 +51,13 @@ def domain
www? ? PublicSuffix.parse("www.#{@uri.host}") : @domain
end
def request(ssl=false, www=false, followlocation=true)
Typhoeus.get(uri(ssl, www), followlocation: followlocation, timeout: 10)
def request(ssl=false, www=false, followlocation=true, ssl_verifypeer=true, ssl_verifyhost=true)

This comment has been minimized.

@konklone

konklone Apr 13, 2015

Collaborator

Adds two additional options to request to disable trust validation and hostname verification, respectively. They have no effect if passed in to an http:// request.

@konklone

konklone Apr 13, 2015

Collaborator

Adds two additional options to request to disable trust validation and hostname verification, respectively. They have no effect if passed in to an http:// request.

This comment has been minimized.

@benbalter

benbalter Apr 20, 2015

Owner

This is getting ugly. Can we make this an options hash with defaults (which is a bit more Ruby)?

@benbalter

benbalter Apr 20, 2015

Owner

This is getting ugly. Can we make this an options hash with defaults (which is a bit more Ruby)?

Show outdated Hide outdated lib/site-inspector.rb
Show outdated Hide outdated lib/site-inspector.rb
Show outdated Hide outdated lib/site-inspector.rb
Show outdated Hide outdated lib/site-inspector.rb
Show outdated Hide outdated lib/site-inspector.rb
Show outdated Hide outdated lib/site-inspector.rb
Show outdated Hide outdated lib/site-inspector.rb
Show outdated Hide outdated lib/site-inspector.rb
Show outdated Hide outdated lib/site-inspector.rb
# If it's a redirect, go find the ultimate response starting from this combo.
redirect_code = response.response_code.to_s.start_with?("3")
location_header = headers["location"]
if redirect_code and location_header

This comment has been minimized.

@konklone

konklone Apr 13, 2015

Collaborator

It's only considered a redirect if it's a 30X response code and the Location header is present (mirroring normal mode's logic). The Location header can indicate other things for 20X (like the URL of a created resource), and some 30X response codes don't indicate redirects.

@konklone

konklone Apr 13, 2015

Collaborator

It's only considered a redirect if it's a 30X response code and the Location header is present (mirroring normal mode's logic). The Location header can indicate other things for 20X (like the URL of a created resource), and some 30X response codes don't indicate redirects.

Show outdated Hide outdated lib/site-inspector.rb
Show outdated Hide outdated lib/site-inspector.rb
:non_www => non_www?,
:redirect => redirect,
:headers => headers
}

This comment has been minimized.

@konklone

konklone Apr 13, 2015

Collaborator

The above "http_only" mode from #22 is still here, but now inaccessible. It could be removed.

@konklone

konklone Apr 13, 2015

Collaborator

The above "http_only" mode from #22 is still here, but now inaccessible. It could be removed.

This comment has been minimized.

@benbalter

benbalter Apr 20, 2015

Owner

but now inaccessible.

If it's not used, it should be removed.

@benbalter

benbalter Apr 20, 2015

Owner

but now inaccessible.

If it's not used, it should be removed.

konklone added some commits Apr 13, 2015

@konklone

This comment has been minimized.

Show comment
Hide comment
@konklone

konklone Apr 13, 2015

Collaborator

I've updated this to include the caching work in #25.

Collaborator

konklone commented Apr 13, 2015

I've updated this to include the caching work in #25.

konklone added some commits Apr 13, 2015

@konklone konklone referenced this pull request Apr 20, 2015

Merged

A more serious HSTS parser #29

1 of 1 task complete
@konklone

This comment has been minimized.

Show comment
Hide comment
@konklone

konklone Apr 20, 2015

Collaborator

OK, @benbalter, I think this PR is ready for review!

There's a reasonably comprehensive set of (network-touching) tests that evaluate the endpoint analysis and domain-wide conclusions from --http mode (which uses the #http method instead of #to_hash).

The behavior of running site-inspector domain.com without --http is untouched, and the original test suite is intact (VCR and all). So this is fully backwards-compatible.

I've used the federal .gov domain list (which includes legislative and judicial, ~1,350 domains) as my test dataset, and I feel pretty confident in the results.

Opt-in as --http mode currently is, this still has a lot of stuff in it. I'll to my best to go over what I think the major differences are.

Holistically: the old mode first looks at whether a domain canonically appeared to use www or not, then looks at whether that endpoint should use https or not. This means only 3 of the 4 possible combinations are ever examined, and many conclusions were drawn only from the eventually determined "canonical" endpoint.

The new mode does try to figure out a "canonical" endpoint for the domain, but pings all 4 and reviews all of them whenever it makes sense. Some conclusions (like whether HSTS applies to the given domain, or where the domain redirects to) are still drawn from the canonical endpoint. The process for figuring out the canonical endpoint is very different, and analyzes how the endpoints appear to redirect to each other.

The new method of HTTP analysis

The new --http mode:

  • classifies about ~100 more of the .gov's as "up" as the old mode did "live".

The definition here is different. The old mode decided a domain is "live" based on whether the response eventually led somewhere with a 200 response code, following redirects if need be. The new definition of "up" is much looser: does the domain respond to HTTP at all, on any endpoint? The only way a domain isn't "up" is if all 4 domains don't respond to HTTP.

  • classifies ~25 more domains as redirect domains than the old mode
  • classifies ~25 fewer domains as redirect domains than the old mode

The new mode looks at all 4 endpoints and at whether or not the domain eventually redirects to an external domain (similar to the old mode). However, unlike old mode, all 4 endpoints need to either be external redirects, or be down, or (in the case of the HTTPS endpoints) a misconfigured cert.

  • classifies ~30 more domains as having valid HTTPS than the old mode

The old mode decided the domain "has HTTPS" by looking at what it decided was the canonical domain, and rejected all cert configuration errors. The new --http mode decides the domain "supports HTTPS" by looking at both HTTPS endpoints to see if either is up and configured. It also allows for certificate chain issues (such as missing intermediates, which many browsers compensate for as described in #21), while not allowing certificate name issues (e.g. a248.e.akamai.net).

  • classifies ~40 more domains as "defaulting" to HTTPS than the old mode did as "forcing" HTTPS
  • classifies ~15 fewer domains as "defaulting" to HTTPS than the old mode did as "forcing" HTTPS

The old mode decided a domain "forced HTTPS" by looking at the canonical endpoint and seeing if it redirected. The new mode decides that a domain "defaults to HTTPS" simply by looking at whether the canonical endpoint was ascertained to use HTTPS. This canonical detection is done by looking at all 4 endpoints and making sure one of the 2 HTTPS endpoints is up, that the HTTP endpoints are either down or redirect somewhere, and that if one of the HTTP endpoints redirects that its initial redirect be to an HTTPS URL within the domain.

  • classifies ~60 more domains as "strictly forcing" HTTPS than the old mode did as "forcing" HTTPS
  • classifies ~27 fewer domains as "strictly forcing" HTTPS than the old mode did as "forcing" HTTPS

The old mode decided a domain "forced HTTPS" by looking at the canonical endpoint and seeing if it redirected. The new mode takes a very different approach that is in most ways stricter and in some looser. A domain is considered as forcing HTTPS if both of its HTTP endpoints are either down or redirect immediately to an https:// URL (whether on an external domain or not), and if at least one of its HTTPS endpoints are up.

This is an imperfect definition, and allows a redirector domain to be classified as "forcing HTTPS" if its HTTP endpoints redirect to an external HTTPS URL and if its HTTPS endpoints redirect to an HTTP URL!

But I think what was previously considered "forcing HTTPS" is best thought of as "defaulting to HTTPS". What I want to measure with "forcing HTTPS" is "if someone visits this domain over HTTP, does the domain immediately correct the user into HTTPS-land?" I'm not sure that forcing a redirector domain to first correct to HTTPS for itself before it redirects to another HTTPS domain is necessary to meet this definition.

A redirector domain whose HTTPS endpoints redirect to external HTTP shouldn't be "punished" for being a redirector for an insecure domain (potentially not in their control). The domain's done its job by offering HTTPS and making sure its HTTP endpoints immediately move the user into HTTPS-land.

The definition requires the HTTPS endpoints to be up because for a site to truly force HTTPS, they also need to have HSTS set and ideally preloaded, which requires HTTPS to be available so that http:// URLs can be freely rewritten to https:// by user agents.

  • classifies 1 more domain as having HSTS than the old mode did
  • classifies 18 fewer domains as having HSTS than the old mode did

The old HSTS detection simply looked at whether the Strict-Transport-Security header is present. As I detailed in #29, this was very insufficient. It also looked at the Strict-Transport-Security header of the domain it eventually redirected to, even an external one. This analyzes only the header on the canonical domain.

  • tracks whether a domain "downgrades HTTPS"

If a domain supports HTTPS, but redirects the HTTPS endpoint at its canonical domain to HTTP, then the site is said to downgrade the user to HTTP.

  • tracks whether a domain's HTTPS endpoints have a cert for an invalid hostname

For HTTPS endpoints, if the initial request doesn't validate HTTPS based on hostname, it's run again while disabling hostname checking.

This value is then used as essentially synonymous with an HTTPS endpoint being "down", since an invalid hostname generally points to an unsupported method for accessing the domain.

  • tracks whether a domain's HTTPS endpoints have a cert with an invalid certificate chain

For HTTPS endpoints, if the initial request doesn't validate HTTPS based on the certificate chain, it's run again while disabling trust verification.

This condition is often allowed as a synonym for "up" in domain analysis, as it generally points to a supported method for accessing the domain, where a domain owner is either depending on visitors to have a custom root installed, or (more commonly) didn't include the intermediate certificates in their chain. In the latter case, many desktop and mobile browsers will, some of the time (not always deterministically!) validate the connection based on AIA fetching as detailed in #21, or by using cached previously seen intermediate certs from prior connections (thus the indeterminisic result).

  • evaluates whether HSTS is enabled domain-wide and whether it's preload-ready

The HSTS header is analyzed to see whether the includeSubDomains directive is present. The root endpoint for the domain is evaluated, whether or not the www endpoint is canonical, as putting includeSubDomains on the www subdomain (a common mistake) will only set HSTS policy for www and subdomains of www.

The HSTS header is also analyzed for whether the preload directive is present, also on the root endpoint for the domain. In addition, Chrome's standards for HSTS preloading -- that includeSubDomains be present, and that the max-age be at least eighteen weeks -- are also evaluated.

A semi-sophisticated HSTS parser was developed in #29 to validate these results.

Other things that happened

  • The method for calculating whether a domain's canonical endpoint was https:// in both the old mode and --http mode was changed to look at whether the domain defaults to HTTPS or not. For example, this changes the canonical domain for nasa.gov from https://www.nasa.gov to http://www.nasa.gov, which is what they use.
  • An opt-in disk cache was added (now in master) to make for far faster development and testing of batch requests. Cache can be enabled with CACHE=[directory] as an environment variable, and its cached value can be ignored, deleted, and replaced by including CACHE_REPLACE=1 as well.
  • The site-inspector binary was sped way up by moving the require calls for third party libraries not used in --http mode into the methods where they are needed, rather than all at the top of site-inspector.rb. Most of the execution time in a site-inspector call with cached network data was simply in loading in a bunch of unused Ruby code. The tests pass with this change, but a couple tests that make calls to those third party libraries directly had lines added to the test files themselves to require the libraries too (even though maybe not strictly necessary).
  • VCR now hooks into typhoeus directly instead of webmock (now in master), which allows redirects to be followed more cleanly, and for VCR to be used for all tests. The test suite was updated to never hit the network. (The new tests for --http mode do hit the network, sadly.)

Ready for merge

A bit of the above numbers are actual changes in the ~week or so between my old and new readings, and I'm not claiming it's bug-free -- I noted one domain that should be a redirect which isn't while typing all that.

But I think this is very much appropriate for merging, so that future PRs can be more incremental and can be working off of a tested base.

Shouldn't be a blocker for the tests, but would an integration test (e.g., Cucumber) work with known (live) domains?

Yeah, they totally could. Right now test_http_endpoints.rb and test_http_conclusions.rb hit the network, and form a sort of sanity check on the process. But there's a lot of domains and situations they check, so it'll get brittle before long. I think a next step post-PR is evaluating a non-curl based HTTP library, so that redirect behavior can be appropriately and granularly trapped and mocked.

Next steps

Some things I think should happen soon:

  • Turn the documentation above into a "data dictionary" like the kind @benbalter suggested above, potentially in YAML.
  • Replace Typhoeus with a non-libcurl-based HTTP client like em-http-request or patron, or some better one I haven't found. (Something with an emphasis on speed would be nice, for large batches.)
  • After replacing Typhoeus, update test suite to use mocked data and not hit the network.
  • Evaluate using HEAD instead of GET requests for HTTP endpoint analysis. @philipashlock has noted that some domains can respond with different responses and headers to HEAD as with GET, which is a total violation of the spec. I'd like to observe the differences at scale.
  • Evaluate whether the nested {http: {www: ..., root: }, https: { ... } structure makes sense, especially for subdomains -- especially when providing www.domain.gov as the input domain.
  • Maybe a retry option to test domains multiple times which appear to be fully down.

Some other tiny things while I have them in my head/notes:

  • Old mode test that ensures it's only considered canonically HTTPS if the domain defaults to HTTPS.
  • More thorough tests for how --http mode handles subdomains, including explicit www subdomains.
  • Verify that :status == 0 can be swapped for :up in the :support_https check.
  • Verify that when redirects are to absolute or relative paths, any paths on the Location header preserve their case. (Domains are case-insensitive, but paths are not.)
  • Verify that ushmm.gov is classifed as a redirect.
  • A domain should still be considered a redirector even if it only listens over plain HTTP, or plain HTTPS.

Todos you've flagged:

  • Separate HSTS parser.
  • Remove debug comments on outputting fetch details.
  • Use hash syntax on #request method instead of 5 positional args.
  • Multi-line hash for aesthetic in #request call.
  • Remove inaccessible lighter --http mode.
  • Better integrate HTTP mode generally.
Collaborator

konklone commented Apr 20, 2015

OK, @benbalter, I think this PR is ready for review!

There's a reasonably comprehensive set of (network-touching) tests that evaluate the endpoint analysis and domain-wide conclusions from --http mode (which uses the #http method instead of #to_hash).

The behavior of running site-inspector domain.com without --http is untouched, and the original test suite is intact (VCR and all). So this is fully backwards-compatible.

I've used the federal .gov domain list (which includes legislative and judicial, ~1,350 domains) as my test dataset, and I feel pretty confident in the results.

Opt-in as --http mode currently is, this still has a lot of stuff in it. I'll to my best to go over what I think the major differences are.

Holistically: the old mode first looks at whether a domain canonically appeared to use www or not, then looks at whether that endpoint should use https or not. This means only 3 of the 4 possible combinations are ever examined, and many conclusions were drawn only from the eventually determined "canonical" endpoint.

The new mode does try to figure out a "canonical" endpoint for the domain, but pings all 4 and reviews all of them whenever it makes sense. Some conclusions (like whether HSTS applies to the given domain, or where the domain redirects to) are still drawn from the canonical endpoint. The process for figuring out the canonical endpoint is very different, and analyzes how the endpoints appear to redirect to each other.

The new method of HTTP analysis

The new --http mode:

  • classifies about ~100 more of the .gov's as "up" as the old mode did "live".

The definition here is different. The old mode decided a domain is "live" based on whether the response eventually led somewhere with a 200 response code, following redirects if need be. The new definition of "up" is much looser: does the domain respond to HTTP at all, on any endpoint? The only way a domain isn't "up" is if all 4 domains don't respond to HTTP.

  • classifies ~25 more domains as redirect domains than the old mode
  • classifies ~25 fewer domains as redirect domains than the old mode

The new mode looks at all 4 endpoints and at whether or not the domain eventually redirects to an external domain (similar to the old mode). However, unlike old mode, all 4 endpoints need to either be external redirects, or be down, or (in the case of the HTTPS endpoints) a misconfigured cert.

  • classifies ~30 more domains as having valid HTTPS than the old mode

The old mode decided the domain "has HTTPS" by looking at what it decided was the canonical domain, and rejected all cert configuration errors. The new --http mode decides the domain "supports HTTPS" by looking at both HTTPS endpoints to see if either is up and configured. It also allows for certificate chain issues (such as missing intermediates, which many browsers compensate for as described in #21), while not allowing certificate name issues (e.g. a248.e.akamai.net).

  • classifies ~40 more domains as "defaulting" to HTTPS than the old mode did as "forcing" HTTPS
  • classifies ~15 fewer domains as "defaulting" to HTTPS than the old mode did as "forcing" HTTPS

The old mode decided a domain "forced HTTPS" by looking at the canonical endpoint and seeing if it redirected. The new mode decides that a domain "defaults to HTTPS" simply by looking at whether the canonical endpoint was ascertained to use HTTPS. This canonical detection is done by looking at all 4 endpoints and making sure one of the 2 HTTPS endpoints is up, that the HTTP endpoints are either down or redirect somewhere, and that if one of the HTTP endpoints redirects that its initial redirect be to an HTTPS URL within the domain.

  • classifies ~60 more domains as "strictly forcing" HTTPS than the old mode did as "forcing" HTTPS
  • classifies ~27 fewer domains as "strictly forcing" HTTPS than the old mode did as "forcing" HTTPS

The old mode decided a domain "forced HTTPS" by looking at the canonical endpoint and seeing if it redirected. The new mode takes a very different approach that is in most ways stricter and in some looser. A domain is considered as forcing HTTPS if both of its HTTP endpoints are either down or redirect immediately to an https:// URL (whether on an external domain or not), and if at least one of its HTTPS endpoints are up.

This is an imperfect definition, and allows a redirector domain to be classified as "forcing HTTPS" if its HTTP endpoints redirect to an external HTTPS URL and if its HTTPS endpoints redirect to an HTTP URL!

But I think what was previously considered "forcing HTTPS" is best thought of as "defaulting to HTTPS". What I want to measure with "forcing HTTPS" is "if someone visits this domain over HTTP, does the domain immediately correct the user into HTTPS-land?" I'm not sure that forcing a redirector domain to first correct to HTTPS for itself before it redirects to another HTTPS domain is necessary to meet this definition.

A redirector domain whose HTTPS endpoints redirect to external HTTP shouldn't be "punished" for being a redirector for an insecure domain (potentially not in their control). The domain's done its job by offering HTTPS and making sure its HTTP endpoints immediately move the user into HTTPS-land.

The definition requires the HTTPS endpoints to be up because for a site to truly force HTTPS, they also need to have HSTS set and ideally preloaded, which requires HTTPS to be available so that http:// URLs can be freely rewritten to https:// by user agents.

  • classifies 1 more domain as having HSTS than the old mode did
  • classifies 18 fewer domains as having HSTS than the old mode did

The old HSTS detection simply looked at whether the Strict-Transport-Security header is present. As I detailed in #29, this was very insufficient. It also looked at the Strict-Transport-Security header of the domain it eventually redirected to, even an external one. This analyzes only the header on the canonical domain.

  • tracks whether a domain "downgrades HTTPS"

If a domain supports HTTPS, but redirects the HTTPS endpoint at its canonical domain to HTTP, then the site is said to downgrade the user to HTTP.

  • tracks whether a domain's HTTPS endpoints have a cert for an invalid hostname

For HTTPS endpoints, if the initial request doesn't validate HTTPS based on hostname, it's run again while disabling hostname checking.

This value is then used as essentially synonymous with an HTTPS endpoint being "down", since an invalid hostname generally points to an unsupported method for accessing the domain.

  • tracks whether a domain's HTTPS endpoints have a cert with an invalid certificate chain

For HTTPS endpoints, if the initial request doesn't validate HTTPS based on the certificate chain, it's run again while disabling trust verification.

This condition is often allowed as a synonym for "up" in domain analysis, as it generally points to a supported method for accessing the domain, where a domain owner is either depending on visitors to have a custom root installed, or (more commonly) didn't include the intermediate certificates in their chain. In the latter case, many desktop and mobile browsers will, some of the time (not always deterministically!) validate the connection based on AIA fetching as detailed in #21, or by using cached previously seen intermediate certs from prior connections (thus the indeterminisic result).

  • evaluates whether HSTS is enabled domain-wide and whether it's preload-ready

The HSTS header is analyzed to see whether the includeSubDomains directive is present. The root endpoint for the domain is evaluated, whether or not the www endpoint is canonical, as putting includeSubDomains on the www subdomain (a common mistake) will only set HSTS policy for www and subdomains of www.

The HSTS header is also analyzed for whether the preload directive is present, also on the root endpoint for the domain. In addition, Chrome's standards for HSTS preloading -- that includeSubDomains be present, and that the max-age be at least eighteen weeks -- are also evaluated.

A semi-sophisticated HSTS parser was developed in #29 to validate these results.

Other things that happened

  • The method for calculating whether a domain's canonical endpoint was https:// in both the old mode and --http mode was changed to look at whether the domain defaults to HTTPS or not. For example, this changes the canonical domain for nasa.gov from https://www.nasa.gov to http://www.nasa.gov, which is what they use.
  • An opt-in disk cache was added (now in master) to make for far faster development and testing of batch requests. Cache can be enabled with CACHE=[directory] as an environment variable, and its cached value can be ignored, deleted, and replaced by including CACHE_REPLACE=1 as well.
  • The site-inspector binary was sped way up by moving the require calls for third party libraries not used in --http mode into the methods where they are needed, rather than all at the top of site-inspector.rb. Most of the execution time in a site-inspector call with cached network data was simply in loading in a bunch of unused Ruby code. The tests pass with this change, but a couple tests that make calls to those third party libraries directly had lines added to the test files themselves to require the libraries too (even though maybe not strictly necessary).
  • VCR now hooks into typhoeus directly instead of webmock (now in master), which allows redirects to be followed more cleanly, and for VCR to be used for all tests. The test suite was updated to never hit the network. (The new tests for --http mode do hit the network, sadly.)

Ready for merge

A bit of the above numbers are actual changes in the ~week or so between my old and new readings, and I'm not claiming it's bug-free -- I noted one domain that should be a redirect which isn't while typing all that.

But I think this is very much appropriate for merging, so that future PRs can be more incremental and can be working off of a tested base.

Shouldn't be a blocker for the tests, but would an integration test (e.g., Cucumber) work with known (live) domains?

Yeah, they totally could. Right now test_http_endpoints.rb and test_http_conclusions.rb hit the network, and form a sort of sanity check on the process. But there's a lot of domains and situations they check, so it'll get brittle before long. I think a next step post-PR is evaluating a non-curl based HTTP library, so that redirect behavior can be appropriately and granularly trapped and mocked.

Next steps

Some things I think should happen soon:

  • Turn the documentation above into a "data dictionary" like the kind @benbalter suggested above, potentially in YAML.
  • Replace Typhoeus with a non-libcurl-based HTTP client like em-http-request or patron, or some better one I haven't found. (Something with an emphasis on speed would be nice, for large batches.)
  • After replacing Typhoeus, update test suite to use mocked data and not hit the network.
  • Evaluate using HEAD instead of GET requests for HTTP endpoint analysis. @philipashlock has noted that some domains can respond with different responses and headers to HEAD as with GET, which is a total violation of the spec. I'd like to observe the differences at scale.
  • Evaluate whether the nested {http: {www: ..., root: }, https: { ... } structure makes sense, especially for subdomains -- especially when providing www.domain.gov as the input domain.
  • Maybe a retry option to test domains multiple times which appear to be fully down.

Some other tiny things while I have them in my head/notes:

  • Old mode test that ensures it's only considered canonically HTTPS if the domain defaults to HTTPS.
  • More thorough tests for how --http mode handles subdomains, including explicit www subdomains.
  • Verify that :status == 0 can be swapped for :up in the :support_https check.
  • Verify that when redirects are to absolute or relative paths, any paths on the Location header preserve their case. (Domains are case-insensitive, but paths are not.)
  • Verify that ushmm.gov is classifed as a redirect.
  • A domain should still be considered a redirector even if it only listens over plain HTTP, or plain HTTPS.

Todos you've flagged:

  • Separate HSTS parser.
  • Remove debug comments on outputting fetch details.
  • Use hash syntax on #request method instead of 5 positional args.
  • Multi-line hash for aesthetic in #request call.
  • Remove inaccessible lighter --http mode.
  • Better integrate HTTP mode generally.
@konklone

This comment has been minimized.

Show comment
Hide comment
@konklone

konklone Apr 20, 2015

Collaborator

And for the thread's benefit and future reference, here's the current state of analyzing whitehouse.gov:

site-inspector whitehouse.gov --http
{
  "endpoints":{
    "https":{
      "www":{
        "https_valid":true,
        "https_bad_chain":false,
        "https_bad_name":false,
        "status":200,
        "up":true,
        "headers":{
          "content-length":"108116",
          "content-type":"text/html; charset=utf-8",
          "x-drupal-cache":"HIT",
          "p3p":"CP=\"NON DSP COR ADM DEV IVA OTPi OUR LEG\"",
          "x-age":"91",
          "x-cache-hits":"7",
          "x-varnish":"805277604 805277520",
          "x-ah-environment":"prod",
          "expires":"Sun, 19 Nov 1978 05:00:00 GMT",
          "link":"<https://www.whitehouse.gov/sites/whitehouse.gov/files/images/fb_share_postcard.jpg>; rel=\"image_src\",<https://www.whitehouse.gov/homepage>; rel=\"canonical\",<https://www.whitehouse.gov/node/5596>; rel=\"shortlink\"",
          "x-generator":"Drupal 7 (http://drupal.org)",
          "x-ua-compatible":"IE=edge,chrome=1",
          "content-language":"en",
          "strict-transport-security":"max-age=3600;include_subdomains",
          "etag":"\"1429368880.095-1\"",
          "date":"Sat, 18 Apr 2015 15:05:02 GMT",
          "connection":"keep-alive"
        },
        "hsts":true,
        "hsts_header":"max-age=3600;include_subdomains",
        "hsts_details":{
          "max_age":3600,
          "include_subdomains":false,
          "preload":false,
          "enabled":true,
          "preload_ready":false
        },
        "redirect":false,
        "redirect_immediately_to":null,
        "redirect_immediately_to_www":false,
        "redirect_immediately_to_https":false,
        "redirect_immediately_external":false,
        "redirect_to":null,
        "redirect_external":false
      },
      "root":{
        "https_valid":true,
        "https_bad_chain":false,
        "https_bad_name":false,
        "status":301,
        "up":true,
        "headers":{
          "server":"AkamaiGHost",
          "content-length":"0",
          "location":"https://www.whitehouse.gov/",
          "date":"Sat, 18 Apr 2015 15:05:02 GMT",
          "connection":"keep-alive",
          "strict-transport-security":"max-age=3600;includeSubDomains;preload",
          "content-type":"text/html; charset=utf-8"
        },
        "hsts":true,
        "hsts_header":"max-age=3600;includeSubDomains;preload",
        "hsts_details":{
          "max_age":3600,
          "include_subdomains":true,
          "preload":true,
          "enabled":true,
          "preload_ready":false
        },
        "redirect":true,
        "redirect_immediately_to":"https://www.whitehouse.gov/",
        "redirect_immediately_to_www":true,
        "redirect_immediately_to_https":true,
        "redirect_immediately_external":false,
        "redirect_to":"https://www.whitehouse.gov/",
        "redirect_external":false
      }
    },
    "http":{
      "www":{
        "status":302,
        "up":true,
        "headers":{
          "content-length":"0",
          "location":"https://www.whitehouse.gov/",
          "date":"Sat, 18 Apr 2015 15:05:02 GMT",
          "connection":"keep-alive",
          "server":"White House",
          "p3p":"CP=\"NON DSP COR ADM DEV IVA OTPi OUR LEG\""
        },
        "hsts":false,
        "hsts_header":null,
        "hsts_details":{
          "max_age":null,
          "include_subdomains":false,
          "preload":false,
          "enabled":false,
          "preload_ready":false
        },
        "redirect":true,
        "redirect_immediately_to":"https://www.whitehouse.gov/",
        "redirect_immediately_to_www":true,
        "redirect_immediately_to_https":true,
        "redirect_immediately_external":false,
        "redirect_to":"https://www.whitehouse.gov/",
        "redirect_external":false
      },
      "root":{
        "status":301,
        "up":true,
        "headers":{
          "server":"AkamaiGHost",
          "content-length":"0",
          "location":"https://www.whitehouse.gov/",
          "date":"Sat, 18 Apr 2015 15:05:03 GMT",
          "connection":"keep-alive",
          "strict-transport-security":"max-age=3600;includeSubDomains;preload",
          "content-type":"text/html; charset=utf-8"
        },
        "hsts":false,
        "hsts_header":"max-age=3600;includeSubDomains;preload",
        "hsts_details":{
          "max_age":3600,
          "include_subdomains":true,
          "preload":true,
          "enabled":true,
          "preload_ready":false
        },
        "redirect":true,
        "redirect_immediately_to":"https://www.whitehouse.gov/",
        "redirect_immediately_to_www":true,
        "redirect_immediately_to_https":true,
        "redirect_immediately_external":false,
        "redirect_to":"https://www.whitehouse.gov/",
        "redirect_external":false
      }
    }
  },
  "canonical_endpoint":"www",
  "canonical_protocol":"https",
  "canonical":"https://www.whitehouse.gov",
  "up":true,
  "broken_root":false,
  "broken_www":false,
  "support_https":true,
  "default_https":true,
  "downgrade_https":false,
  "enforce_https":true,
  "redirect":false,
  "redirect_to":null,
  "hsts":true,
  "hsts_header":"max-age=3600;include_subdomains",
  "hsts_entire_domain":true,
  "hsts_entire_domain_preload":false
}
Collaborator

konklone commented Apr 20, 2015

And for the thread's benefit and future reference, here's the current state of analyzing whitehouse.gov:

site-inspector whitehouse.gov --http
{
  "endpoints":{
    "https":{
      "www":{
        "https_valid":true,
        "https_bad_chain":false,
        "https_bad_name":false,
        "status":200,
        "up":true,
        "headers":{
          "content-length":"108116",
          "content-type":"text/html; charset=utf-8",
          "x-drupal-cache":"HIT",
          "p3p":"CP=\"NON DSP COR ADM DEV IVA OTPi OUR LEG\"",
          "x-age":"91",
          "x-cache-hits":"7",
          "x-varnish":"805277604 805277520",
          "x-ah-environment":"prod",
          "expires":"Sun, 19 Nov 1978 05:00:00 GMT",
          "link":"<https://www.whitehouse.gov/sites/whitehouse.gov/files/images/fb_share_postcard.jpg>; rel=\"image_src\",<https://www.whitehouse.gov/homepage>; rel=\"canonical\",<https://www.whitehouse.gov/node/5596>; rel=\"shortlink\"",
          "x-generator":"Drupal 7 (http://drupal.org)",
          "x-ua-compatible":"IE=edge,chrome=1",
          "content-language":"en",
          "strict-transport-security":"max-age=3600;include_subdomains",
          "etag":"\"1429368880.095-1\"",
          "date":"Sat, 18 Apr 2015 15:05:02 GMT",
          "connection":"keep-alive"
        },
        "hsts":true,
        "hsts_header":"max-age=3600;include_subdomains",
        "hsts_details":{
          "max_age":3600,
          "include_subdomains":false,
          "preload":false,
          "enabled":true,
          "preload_ready":false
        },
        "redirect":false,
        "redirect_immediately_to":null,
        "redirect_immediately_to_www":false,
        "redirect_immediately_to_https":false,
        "redirect_immediately_external":false,
        "redirect_to":null,
        "redirect_external":false
      },
      "root":{
        "https_valid":true,
        "https_bad_chain":false,
        "https_bad_name":false,
        "status":301,
        "up":true,
        "headers":{
          "server":"AkamaiGHost",
          "content-length":"0",
          "location":"https://www.whitehouse.gov/",
          "date":"Sat, 18 Apr 2015 15:05:02 GMT",
          "connection":"keep-alive",
          "strict-transport-security":"max-age=3600;includeSubDomains;preload",
          "content-type":"text/html; charset=utf-8"
        },
        "hsts":true,
        "hsts_header":"max-age=3600;includeSubDomains;preload",
        "hsts_details":{
          "max_age":3600,
          "include_subdomains":true,
          "preload":true,
          "enabled":true,
          "preload_ready":false
        },
        "redirect":true,
        "redirect_immediately_to":"https://www.whitehouse.gov/",
        "redirect_immediately_to_www":true,
        "redirect_immediately_to_https":true,
        "redirect_immediately_external":false,
        "redirect_to":"https://www.whitehouse.gov/",
        "redirect_external":false
      }
    },
    "http":{
      "www":{
        "status":302,
        "up":true,
        "headers":{
          "content-length":"0",
          "location":"https://www.whitehouse.gov/",
          "date":"Sat, 18 Apr 2015 15:05:02 GMT",
          "connection":"keep-alive",
          "server":"White House",
          "p3p":"CP=\"NON DSP COR ADM DEV IVA OTPi OUR LEG\""
        },
        "hsts":false,
        "hsts_header":null,
        "hsts_details":{
          "max_age":null,
          "include_subdomains":false,
          "preload":false,
          "enabled":false,
          "preload_ready":false
        },
        "redirect":true,
        "redirect_immediately_to":"https://www.whitehouse.gov/",
        "redirect_immediately_to_www":true,
        "redirect_immediately_to_https":true,
        "redirect_immediately_external":false,
        "redirect_to":"https://www.whitehouse.gov/",
        "redirect_external":false
      },
      "root":{
        "status":301,
        "up":true,
        "headers":{
          "server":"AkamaiGHost",
          "content-length":"0",
          "location":"https://www.whitehouse.gov/",
          "date":"Sat, 18 Apr 2015 15:05:03 GMT",
          "connection":"keep-alive",
          "strict-transport-security":"max-age=3600;includeSubDomains;preload",
          "content-type":"text/html; charset=utf-8"
        },
        "hsts":false,
        "hsts_header":"max-age=3600;includeSubDomains;preload",
        "hsts_details":{
          "max_age":3600,
          "include_subdomains":true,
          "preload":true,
          "enabled":true,
          "preload_ready":false
        },
        "redirect":true,
        "redirect_immediately_to":"https://www.whitehouse.gov/",
        "redirect_immediately_to_www":true,
        "redirect_immediately_to_https":true,
        "redirect_immediately_external":false,
        "redirect_to":"https://www.whitehouse.gov/",
        "redirect_external":false
      }
    }
  },
  "canonical_endpoint":"www",
  "canonical_protocol":"https",
  "canonical":"https://www.whitehouse.gov",
  "up":true,
  "broken_root":false,
  "broken_www":false,
  "support_https":true,
  "default_https":true,
  "downgrade_https":false,
  "enforce_https":true,
  "redirect":false,
  "redirect_to":null,
  "hsts":true,
  "hsts_header":"max-age=3600;include_subdomains",
  "hsts_entire_domain":true,
  "hsts_entire_domain_preload":false
}
if ENV['CACHE']
Typhoeus::Config.cache = SiteInspectorDiskCache.new(ENV['CACHE'])
Typhoeus::Config.cache = SiteInspectorDiskCache.new(ENV['CACHE'], ENV['CACHE_REPLACE'])

This comment has been minimized.

@benbalter

benbalter Apr 20, 2015

Owner

Should likely call Dotenv.load to read .env files here if we're using Env vars.

@benbalter

benbalter Apr 20, 2015

Owner

Should likely call Dotenv.load to read .env files here if we're using Env vars.

def initialize(domain)
# Utility parser for HSTS headers.
# RFC: http://tools.ietf.org/html/rfc6797
def self.hsts_parse(header)

This comment has been minimized.

@benbalter

benbalter Apr 20, 2015

Owner

This feels out of place here. Can we move it to its own class? Or at least its own file?

@benbalter

benbalter Apr 20, 2015

Owner

This feels out of place here. Can we move it to its own class? Or at least its own file?

This comment has been minimized.

@konklone

konklone Apr 20, 2015

Collaborator

Yeah, it belongs in its own library. You want me to move it to an hsts-parser repo first?

@konklone

konklone Apr 20, 2015

Collaborator

Yeah, it belongs in its own library. You want me to move it to an hsts-parser repo first?

This comment has been minimized.

@benbalter

benbalter Apr 20, 2015

Owner

Your call. Not sure if there's more to it / it's useful elsewhere / you want to maintain it. Even a ./lib/hsts_parser.rb would feel better.

@benbalter

benbalter Apr 20, 2015

Owner

Your call. Not sure if there's more to it / it's useful elsewhere / you want to maintain it. Even a ./lib/hsts_parser.rb would feel better.

to_get = uri(ssl, www)
# debugging
# puts "fetching: #{to_get}, #{followlocation ? "follow" : "no follow"}, #{ssl_verifypeer ? "verify peer, " : ""}#{ssl_verifyhost ? "verify host" : ""}"

This comment has been minimized.

@benbalter

benbalter Apr 20, 2015

Owner

Debug infos.

@benbalter

benbalter Apr 20, 2015

Owner

Debug infos.

# debugging
# puts "fetching: #{to_get}, #{followlocation ? "follow" : "no follow"}, #{ssl_verifypeer ? "verify peer, " : ""}#{ssl_verifyhost ? "verify host" : ""}"
Typhoeus.get(to_get, followlocation: followlocation, ssl_verifypeer: ssl_verifypeer, ssl_verifyhost: (ssl_verifyhost ? 2 : 0), timeout: @timeout)

This comment has been minimized.

@benbalter

benbalter Apr 20, 2015

Owner

I'd put each hash value on its own line for readability / line length

@benbalter

benbalter Apr 20, 2015

Owner

I'd put each hash value on its own line for readability / line length

@@ -127,41 +190,447 @@ def redirect
end
end
def to_json
to_hash.to_json
def http

This comment has been minimized.

@benbalter

benbalter Apr 20, 2015

Owner

What does the http method do?

@benbalter

benbalter Apr 20, 2015

Owner

What does the http method do?

def to_json
to_hash.to_json
def http
details = {

This comment has been minimized.

@benbalter

benbalter Apr 20, 2015

Owner

Needs a comment here. No idea what this is doing.

@benbalter

benbalter Apr 20, 2015

Owner

Needs a comment here. No idea what this is doing.

# or like:
# https:// -> 200, http:// -> http://www
www = !!(

This comment has been minimized.

@benbalter

benbalter Apr 20, 2015

Owner

This should be its own method, not a variable in a method. How does this relate to the existing www? method? Perhaps www should return true, false, or :canonical (which would be truthy as well)?

@benbalter

benbalter Apr 20, 2015

Owner

This should be its own method, not a variable in a method. How does this relate to the existing www? method? Perhaps www should return true, false, or :canonical (which would be truthy as well)?

!combos[:http][:root][:up] or
!combos[:http][:root][:status].to_s.start_with?("2")
)
) and (

This comment has been minimized.

@benbalter

benbalter Apr 20, 2015

Owner

This is super gross. Is there any way to make the logic here more explicit (even if more verbose)? Perhaps breaking out a bunch of named methods?

@benbalter

benbalter Apr 20, 2015

Owner

This is super gross. Is there any way to make the logic here more explicit (even if more verbose)? Perhaps breaking out a bunch of named methods?

# It allows a site to be canonically HTTPS if the cert has
# a valid hostname but invalid chain issues.
https = !!(

This comment has been minimized.

@benbalter

benbalter Apr 20, 2015

Owner

Same here. I'd make this the https method.

@benbalter

benbalter Apr 20, 2015

Owner

Same here. I'd make this the https method.

combos[:http][:root][:redirect] or
!combos[:http][:root][:up] or
!combos[:http][:root][:status].to_s.start_with?("2")
) and (

This comment has been minimized.

@benbalter

benbalter Apr 20, 2015

Owner

Same here re nested conditionals

@benbalter

benbalter Apr 20, 2015

Owner

Same here re nested conditionals

)
)
details[:canonical_endpoint] = www ? :www : :root

This comment has been minimized.

@benbalter

benbalter Apr 20, 2015

Owner

I'd make each one of the "details" a method. The http method should simply build a hash. It shouldn't perform any calculations itself.

@benbalter

benbalter Apr 20, 2015

Owner

I'd make each one of the "details" a method. The http method should simply build a hash. It shouldn't perform any calculations itself.

:secure_cookies => secure_cookies?,
:strict_transport_security => strict_transport_security?,
:headers => headers
https: {

This comment has been minimized.

@benbalter

benbalter Apr 20, 2015

Owner

This is a big change in behavior. Over in #22 we talked about having a hash and an expanded hash. This seems to essentially bury all the non-http checks (which would presumably be consistent between http and https).

@benbalter

benbalter Apr 20, 2015

Owner

This is a big change in behavior. Over in #22 we talked about having a hash and an expanded hash. This seems to essentially bury all the non-http checks (which would presumably be consistent between http and https).

}
end
# State of affairs at a particular endpoint.

This comment has been minimized.

@benbalter

benbalter Apr 20, 2015

Owner

Can you explain this a bit more? I'm lost.

@benbalter

benbalter Apr 20, 2015

Owner

Can you explain this a bit more? I'm lost.

@benbalter

This comment has been minimized.

Show comment
Hide comment
@benbalter

benbalter Apr 20, 2015

Owner

Awesome work. Really excited to get this merged. There's a lot of great stuff in here, and it's making the gem a much higher-quality checker.

A giant 👍 on the behavior and choices you made in terms of the actual checks. A giant 👎 (in the nicest way possible) on the implementation. There are two things here that feel extremely brittle and hard to maintain:

  1. Two parallel behaviors depending on a (command-line use only) flag
  2. A giant super method that makes all sorts of calculations and builds a giant hash

Instead, in both http and http_endpoint, I'd break each of the individual checks out into their own methods. Ruby is very object-oriented and lots of small methods (rather than one giant method) makes testing functionality possible on an extremely granular level.

I'd actually like to get rid of http mode. Let's make it the default command-line behavior and support an expanded list of checks with some sort of --extended or similar flag. In Ruby land, we'll only calculate each check when the appropriate method is called, so it's not an issue.

A few specifics to your last comment:

The new definition of "up" is much looser

The test should be "if I type this domain into IE, will I see a page?". Is this still the case?

"strictly forcing" HTTPS, "defaulting" HTTPS

Yeah, we're going to need to document what each term means. I'm even lost reading this.

A redirector domain whose HTTPS endpoints redirect to external HTTP shouldn't be "punished" for being a redirector for an insecure domain (potentially not in their control).

That makes sense on paper, but then isn't how other checks work. For example, foo.gov redirects to bar.gov. foo.gov redirects via Nginx. bar.gov runs WordPress. A site inspector query for foo.gov will return WordPress as the CMS right now (the target domain). That's the in-browser experience, at least. For HTTPS tests, it sounds like we're only looking at the domain requested. I'd imagine we'd want to be consistent for sanity sake.

tracks whether a domain "downgrades HTTPS"

How is this represented? I'm imagining an HTTPS flag that has a set list of options (e.g., default, enforce, downgrade, false, etc.)

This value is then used as essentially synonymous with an HTTPS endpoint being "down", since an invalid hostname generally points to an unsupported method for accessing the domain.

👍

Cache can be enabled with CACHE=[directory]

I appreciate your use case, just be mindful that this is also a Ruby library which can be used in Rubyland (e.g., not via command line).

Replace Typhoeus with a non-libcurl-based HTTP client like em-http-request or patron, or some better one I haven't found. (Something with an emphasis on speed would be nice, for large batches.)

That's why I used Typhoeus, which can batch requests in parallel. What's the issue with cURL?

so that the --http flag now triggers a very different response object.

I'm supportive of a --http flag that changes the output, but 👎 to maintaining to parallel behaviors. Let's make decisions where we can. See above.

I think this is 99% good to merge, and with a bit of copy-pasta to break things out into separate methods, will be much stronger in the long run.

Owner

benbalter commented Apr 20, 2015

Awesome work. Really excited to get this merged. There's a lot of great stuff in here, and it's making the gem a much higher-quality checker.

A giant 👍 on the behavior and choices you made in terms of the actual checks. A giant 👎 (in the nicest way possible) on the implementation. There are two things here that feel extremely brittle and hard to maintain:

  1. Two parallel behaviors depending on a (command-line use only) flag
  2. A giant super method that makes all sorts of calculations and builds a giant hash

Instead, in both http and http_endpoint, I'd break each of the individual checks out into their own methods. Ruby is very object-oriented and lots of small methods (rather than one giant method) makes testing functionality possible on an extremely granular level.

I'd actually like to get rid of http mode. Let's make it the default command-line behavior and support an expanded list of checks with some sort of --extended or similar flag. In Ruby land, we'll only calculate each check when the appropriate method is called, so it's not an issue.

A few specifics to your last comment:

The new definition of "up" is much looser

The test should be "if I type this domain into IE, will I see a page?". Is this still the case?

"strictly forcing" HTTPS, "defaulting" HTTPS

Yeah, we're going to need to document what each term means. I'm even lost reading this.

A redirector domain whose HTTPS endpoints redirect to external HTTP shouldn't be "punished" for being a redirector for an insecure domain (potentially not in their control).

That makes sense on paper, but then isn't how other checks work. For example, foo.gov redirects to bar.gov. foo.gov redirects via Nginx. bar.gov runs WordPress. A site inspector query for foo.gov will return WordPress as the CMS right now (the target domain). That's the in-browser experience, at least. For HTTPS tests, it sounds like we're only looking at the domain requested. I'd imagine we'd want to be consistent for sanity sake.

tracks whether a domain "downgrades HTTPS"

How is this represented? I'm imagining an HTTPS flag that has a set list of options (e.g., default, enforce, downgrade, false, etc.)

This value is then used as essentially synonymous with an HTTPS endpoint being "down", since an invalid hostname generally points to an unsupported method for accessing the domain.

👍

Cache can be enabled with CACHE=[directory]

I appreciate your use case, just be mindful that this is also a Ruby library which can be used in Rubyland (e.g., not via command line).

Replace Typhoeus with a non-libcurl-based HTTP client like em-http-request or patron, or some better one I haven't found. (Something with an emphasis on speed would be nice, for large batches.)

That's why I used Typhoeus, which can batch requests in parallel. What's the issue with cURL?

so that the --http flag now triggers a very different response object.

I'm supportive of a --http flag that changes the output, but 👎 to maintaining to parallel behaviors. Let's make decisions where we can. See above.

I think this is 99% good to merge, and with a bit of copy-pasta to break things out into separate methods, will be much stronger in the long run.

@konklone

This comment has been minimized.

Show comment
Hide comment
@konklone

konklone Apr 20, 2015

Collaborator

There are two things here that feel extremely brittle and hard to maintain:

  1. Two parallel behaviors depending on a (command-line use only) flag

To be clear, I did it this way as an interim thing, to avoid breaking backwards compatibility, and to avoid having to rewrite existing methods and guarantee harmony between old and new definitions.

  1. A giant super method that makes all sorts of calculations and builds a giant hash

I don't have the same negative reaction to this as you, though I think it would definitely benefit from having each major decision broken out into its own method.

That said, fundamentally the methodology here is defining boolean logic that creates rules about how all 4 endpoints interact to make holistic judgments about a domain, so big and/or logic trees are not avoidable, and I think document the logic more clearly than breaking each leaf of the tree into its own method.

Instead, in both http and http_endpoint, I'd break each of the individual checks out into their own methods. Ruby is very object-oriented and lots of small methods (rather than one giant method) makes testing functionality possible on an extremely granular level.

👍

I'd actually like to get rid of http mode. Let's make it the default command-line behavior and support an expanded list of checks with some sort of --extended or similar flag. In Ruby land, we'll only calculate each check when the appropriate method is called, so it's not an issue.

👍 This was what I was hoping we'd do. I think that reorg would either be best done post-PR, or by branching from this branch and doing it as a PR to this PR, to at least have some isolated piece of work/documentation that's being compared to the (stable) situation on this branch at the moment.

The new definition of "up" is much looser

The test should be "if I type this domain into IE, will I see a page?". Is this still the case?

It's not -- something can be "up" if the domain responds to HTTP with a redirect to a domain that doesn't exist. So we could re-introduce a "live" field that matches the old definition, in addition to "up".

Whether a domain responds to HTTP at all is an important distinction in seeing whether a domain is web-enabled, which is what "up" is meant to measure.

"strictly forcing" HTTPS, "defaulting" HTTPS

Yeah, we're going to need to document what each term means. I'm even lost reading this.

It's non-trivial, which is why it took me a while to do this, and why I felt the need to branch off into my own parallel track, so that I could more freely adapt definitions as I figured out what I needed to know. This spreadsheet has some loose definitions at the top, and might make it more obvious why I've separated the two definitions.

A redirector domain whose HTTPS endpoints redirect to external HTTP shouldn't be "punished" for being a redirector for an insecure domain (potentially not in their control).

That makes sense on paper, but then isn't how other checks work. For example, foo.gov redirects to bar.gov. foo.gov redirects via Nginx. bar.gov runs WordPress. A site inspector query for foo.gov will return WordPress as the CMS right now (the target domain). That's the in-browser experience, at least. For HTTPS tests, it sounds like we're only looking at the domain requested. I'd imagine we'd want to be consistent for sanity sake.

I agree with the sentiment, but I don't know how to reconcile this. What makes sense for a CMS check (the eventually redirected-to domain) just doesn't make the same kind of sense when analyzing some factors about a domain, like whether it responds to HTTP, or responds to HTTPS, or whether HSTS is enabled (where the domain redirects to doesn't matter).

It's not impossible that we end up with these being separated into two different modes, or even that this branches into its own tool. I'm very flexible with how site-inspector would like to see these new definitions incorporated, tests and refactoring to be done, etc. But I'm not very flexible on the definitions, because they're the definitions I need. So if the definitions aren't reconcilable, I'm even happy to spin this into its own thing and then help integrate parts of it back into site-inspector as a dependency.

tracks whether a domain "downgrades HTTPS"

How is this represented? I'm imagining an HTTPS flag that has a set list of options (e.g., default, enforce, downgrade, false, etc.)

Right now, a :downgrade_https field on the hash, which indicates that a domain appears to redirect HTTPS requests down to HTTP. So a valid cert may be installed on the domain, but it's functionally unsupported.

This value is then used as essentially synonymous with an HTTPS endpoint being "down", since an invalid hostname generally points to an unsupported method for accessing the domain.

👍

(I'll just re-emphasize here that certificate chain errors are treated as synonymous with the domain being up, as usually it points to an incomplete or bungled attempt to formally support HTTPS access.)

Cache can be enabled with CACHE=[directory]

I appreciate your use case, just be mindful that this is also a Ruby library which can be used in Rubyland (e.g., not via command line).

Totally. Right now though, the cache gets set when the library is loaded. It may make sense for cache construction to be done in a method, so that Ruby can control it at all. (Makes me miss Node, where you can pass arguments to a require call, if the module you're loading is itself a function.)

Replace Typhoeus with a non-libcurl-based HTTP client like em-http-request or patron, or some better one I haven't found. (Something with an emphasis on speed would be nice, for large batches.)

That's why I used Typhoeus, which can batch requests in parallel. What's the issue with cURL?

Using a libcurl-based library means that all HTTP redirects are followed in C-land, not Ruby-land, and thus mocking libraries like Webmock or VCR cannot observe, record, or stub them. This caused me irreconcilable issues when testing the new behavior. I was able to find a workaround for following redirects generally in #23, but this workaround was insufficient when I added functionality that disabled following redirects and then mixed it with calls that do follow redirects.

This happens when the domain is probed for both kinds of behavior. What a domain redirects to immediately, and what it redirects to eventually, are both independently useful in deciding the attributes of a domain. But combining these two, using a libcurl-based Ruby library, isn't going to work with Ruby-based web mocking tools. :/

so that the --http flag now triggers a very different response object.

I'm supportive of a --http flag that changes the output, but 👎 to maintaining to parallel behaviors. Let's make decisions where we can. See above.

I think this is 99% good to merge, and with a bit of copy-pasta to break things out into separate methods, will be much stronger in the long run.

Check out some of my responses above, and tell me what you think would be best here -- particularly about consistency of treating a domain on its own vs what it redirects to.

Collaborator

konklone commented Apr 20, 2015

There are two things here that feel extremely brittle and hard to maintain:

  1. Two parallel behaviors depending on a (command-line use only) flag

To be clear, I did it this way as an interim thing, to avoid breaking backwards compatibility, and to avoid having to rewrite existing methods and guarantee harmony between old and new definitions.

  1. A giant super method that makes all sorts of calculations and builds a giant hash

I don't have the same negative reaction to this as you, though I think it would definitely benefit from having each major decision broken out into its own method.

That said, fundamentally the methodology here is defining boolean logic that creates rules about how all 4 endpoints interact to make holistic judgments about a domain, so big and/or logic trees are not avoidable, and I think document the logic more clearly than breaking each leaf of the tree into its own method.

Instead, in both http and http_endpoint, I'd break each of the individual checks out into their own methods. Ruby is very object-oriented and lots of small methods (rather than one giant method) makes testing functionality possible on an extremely granular level.

👍

I'd actually like to get rid of http mode. Let's make it the default command-line behavior and support an expanded list of checks with some sort of --extended or similar flag. In Ruby land, we'll only calculate each check when the appropriate method is called, so it's not an issue.

👍 This was what I was hoping we'd do. I think that reorg would either be best done post-PR, or by branching from this branch and doing it as a PR to this PR, to at least have some isolated piece of work/documentation that's being compared to the (stable) situation on this branch at the moment.

The new definition of "up" is much looser

The test should be "if I type this domain into IE, will I see a page?". Is this still the case?

It's not -- something can be "up" if the domain responds to HTTP with a redirect to a domain that doesn't exist. So we could re-introduce a "live" field that matches the old definition, in addition to "up".

Whether a domain responds to HTTP at all is an important distinction in seeing whether a domain is web-enabled, which is what "up" is meant to measure.

"strictly forcing" HTTPS, "defaulting" HTTPS

Yeah, we're going to need to document what each term means. I'm even lost reading this.

It's non-trivial, which is why it took me a while to do this, and why I felt the need to branch off into my own parallel track, so that I could more freely adapt definitions as I figured out what I needed to know. This spreadsheet has some loose definitions at the top, and might make it more obvious why I've separated the two definitions.

A redirector domain whose HTTPS endpoints redirect to external HTTP shouldn't be "punished" for being a redirector for an insecure domain (potentially not in their control).

That makes sense on paper, but then isn't how other checks work. For example, foo.gov redirects to bar.gov. foo.gov redirects via Nginx. bar.gov runs WordPress. A site inspector query for foo.gov will return WordPress as the CMS right now (the target domain). That's the in-browser experience, at least. For HTTPS tests, it sounds like we're only looking at the domain requested. I'd imagine we'd want to be consistent for sanity sake.

I agree with the sentiment, but I don't know how to reconcile this. What makes sense for a CMS check (the eventually redirected-to domain) just doesn't make the same kind of sense when analyzing some factors about a domain, like whether it responds to HTTP, or responds to HTTPS, or whether HSTS is enabled (where the domain redirects to doesn't matter).

It's not impossible that we end up with these being separated into two different modes, or even that this branches into its own tool. I'm very flexible with how site-inspector would like to see these new definitions incorporated, tests and refactoring to be done, etc. But I'm not very flexible on the definitions, because they're the definitions I need. So if the definitions aren't reconcilable, I'm even happy to spin this into its own thing and then help integrate parts of it back into site-inspector as a dependency.

tracks whether a domain "downgrades HTTPS"

How is this represented? I'm imagining an HTTPS flag that has a set list of options (e.g., default, enforce, downgrade, false, etc.)

Right now, a :downgrade_https field on the hash, which indicates that a domain appears to redirect HTTPS requests down to HTTP. So a valid cert may be installed on the domain, but it's functionally unsupported.

This value is then used as essentially synonymous with an HTTPS endpoint being "down", since an invalid hostname generally points to an unsupported method for accessing the domain.

👍

(I'll just re-emphasize here that certificate chain errors are treated as synonymous with the domain being up, as usually it points to an incomplete or bungled attempt to formally support HTTPS access.)

Cache can be enabled with CACHE=[directory]

I appreciate your use case, just be mindful that this is also a Ruby library which can be used in Rubyland (e.g., not via command line).

Totally. Right now though, the cache gets set when the library is loaded. It may make sense for cache construction to be done in a method, so that Ruby can control it at all. (Makes me miss Node, where you can pass arguments to a require call, if the module you're loading is itself a function.)

Replace Typhoeus with a non-libcurl-based HTTP client like em-http-request or patron, or some better one I haven't found. (Something with an emphasis on speed would be nice, for large batches.)

That's why I used Typhoeus, which can batch requests in parallel. What's the issue with cURL?

Using a libcurl-based library means that all HTTP redirects are followed in C-land, not Ruby-land, and thus mocking libraries like Webmock or VCR cannot observe, record, or stub them. This caused me irreconcilable issues when testing the new behavior. I was able to find a workaround for following redirects generally in #23, but this workaround was insufficient when I added functionality that disabled following redirects and then mixed it with calls that do follow redirects.

This happens when the domain is probed for both kinds of behavior. What a domain redirects to immediately, and what it redirects to eventually, are both independently useful in deciding the attributes of a domain. But combining these two, using a libcurl-based Ruby library, isn't going to work with Ruby-based web mocking tools. :/

so that the --http flag now triggers a very different response object.

I'm supportive of a --http flag that changes the output, but 👎 to maintaining to parallel behaviors. Let's make decisions where we can. See above.

I think this is 99% good to merge, and with a bit of copy-pasta to break things out into separate methods, will be much stronger in the long run.

Check out some of my responses above, and tell me what you think would be best here -- particularly about consistency of treating a domain on its own vs what it redirects to.

@benbalter

This comment has been minimized.

Show comment
Hide comment
@benbalter

benbalter Apr 20, 2015

Owner

to avoid breaking backwards compatibility,

I wouldn't worry too much about that. I don't know any users besides us, and SemVer, so YOLO. 😄

I'm very flexible with how site-inspector would like to see these new definitions incorporated, tests and refactoring to be done, etc. But I'm not very flexible on the definitions, because they're the definitions I need.

That makes sense. (and I'm super happy with your definitions being codified / someone thinking through critically what is actually meant). Perhaps a follow redirects flag (defaulting to off) would reconcile things?

I think that reorg would either be best done post-PR,

Here's what I propose: Lets merge this basically as is. It's solid and it works. I'm a fan of wiring things up, then going back and making them elegant. If it's okay with you, we can split up the workload, if you can tackle documentation, I'd be glad to take a pass at refactoring things a bit in a new PR for your review.

Owner

benbalter commented Apr 20, 2015

to avoid breaking backwards compatibility,

I wouldn't worry too much about that. I don't know any users besides us, and SemVer, so YOLO. 😄

I'm very flexible with how site-inspector would like to see these new definitions incorporated, tests and refactoring to be done, etc. But I'm not very flexible on the definitions, because they're the definitions I need.

That makes sense. (and I'm super happy with your definitions being codified / someone thinking through critically what is actually meant). Perhaps a follow redirects flag (defaulting to off) would reconcile things?

I think that reorg would either be best done post-PR,

Here's what I propose: Lets merge this basically as is. It's solid and it works. I'm a fan of wiring things up, then going back and making them elegant. If it's okay with you, we can split up the workload, if you can tackle documentation, I'd be glad to take a pass at refactoring things a bit in a new PR for your review.

@konklone

This comment has been minimized.

Show comment
Hide comment
@konklone

konklone Apr 21, 2015

Collaborator

Perhaps a follow redirects flag (defaulting to off) would reconcile things?

Something like that could definitely work.

If it's okay with you, we can split up the workload, if you can tackle documentation, I'd be glad to take a pass at refactoring things a bit in a new PR for your review.

Sounds good to me if it's good with you. I'll leave the merge 🔨 to you, and we can go from there.

Collaborator

konklone commented Apr 21, 2015

Perhaps a follow redirects flag (defaulting to off) would reconcile things?

Something like that could definitely work.

If it's okay with you, we can split up the workload, if you can tackle documentation, I'd be glad to take a pass at refactoring things a bit in a new PR for your review.

Sounds good to me if it's good with you. I'll leave the merge 🔨 to you, and we can go from there.

benbalter added a commit that referenced this pull request Apr 21, 2015

Merge pull request #24 from benbalter/light
A much more detailed, granular HTTP mode

@benbalter benbalter merged commit 8f64458 into master Apr 21, 2015

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@benbalter benbalter deleted the light branch Apr 21, 2015

This was referenced May 2, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment