OutputSanitizingAsyncFetch: runs right before PSOL responds #1695
Conversation
Headers starting with '@' will be removed from any output emitted by PSOL. This allows a concept of 'internal headers' which can be used to pass information around internally. In a follow-up PR this will be used to track (and remember) which redirects have been followed while fetching a resource, and with that we will be able to verify if using the resource as a basis for optimized output would violate effective Content-Security-Policies, if any. Note that a header is not allowed to contain '@' per http spec, and if one does 'escape' httpd will respond with a 5xx.
Using php and/or using "Headers always" in configuration to set up CSP headers in httpd is probably pretty common. These will both end up in err_headers_out, a httpd-only concept meaning that the headers should always be emitted, even on error statusses. We want to consider these CSP headers living in this collection when rewriting content, but leave them alone otherwise. Instead of introducing an err_headers_out (like) concept in RewriteDriver to model this, we will rely on the internal headers concept. ScanFilter will also look at kInternalContentSecurityPolicy. This header will be sanitized from our own output. Relies on #1695
Hmm, I guess Apache code can just filters those when converting back? |
@morlovich Yes, fixed that in another PR. However, I recently ran into a hard to reproduce CHECK failure, probably caused by one of these changes:
Perhaps the check failure is caused by |
net/instaweb/http/async_fetch.cc
Outdated
} | ||
|
||
void OutputSanitizingAsyncFetch::HandleDone(bool success) { | ||
SanitizeResponseHeaders(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one might be dubious threading-wise? Though I think our story on this hasn't been very clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Welll... There is some documentation here:
https://github.com/pagespeed/mod_pagespeed/blob/master/net/instaweb/http/public/async_fetch.h#L114
.... which is not a very good place to have it.
At any rate thread issues seem very likely for this check if you look at how the code in RemoveAllWithPrefix looks like.
- Filter headers starting with '@' in AddResponseHeadersToRequestHelper
Using php and/or using "Headers always" in configuration to set up CSP headers in httpd is probably pretty common. These will both end up in err_headers_out, a httpd-only concept meaning that the headers should always be emitted, even on error statusses. We want to consider these CSP headers living in this collection when rewriting content, but leave them alone otherwise. Instead of introducing an err_headers_out (like) concept in RewriteDriver to model this, we will rely on the internal headers concept. ScanFilter will also look at kInternalContentSecurityPolicy. This header will be sanitized from our own output. Relies on #1695
updated, this PR now includes the code to also strip internal headers in Apache, and |
Had a flash of a "dumb question": instead of adding headers with magic names and then having to care about removing them, why not just have a separate spot, like a second ResponseHeaders? And I think I need to re-echo my earlier though that we may have more bugs with err_headers_out elsewhere.. |
@morlovich I think that is a great question.. Let me put down some thoughts: The initial reason reason for storing internal headers in the current ResponseHeaders set is that these get persisted automatically. That is a useful attribute for the redirect-following feature, because we want to remember how we got them so we will be able to evaluate CSP when we pull them from cache. Then came the story about CSP often ending up in err_response_headers.The concept of err_headers out seemed a bit httpd-specific. Having said that: I can see that it would be useful if we would be able to remember which header came from where with respect to err_headers_out (which would help with the possible other bugs with err_headers_out you mention, I think?). Modelling err_response_headers may be useful to accomplish that. Though thinking about it, maybe I would discuss an alternative: adding a 'source' attribute (and maybe some more metadata) to header/value pairs contained in the current ResponseHeaders structure to make it easier for server implementations to correctly map back any rewritten headers we give them, and perhaps state our intent (maybe "write back/"don't write back", or even set/append/delete/skip). If we can state intent, we don't need magic names. Considering the above, I'd lean towards thinking that this is stuff for more discussion maybe, and perhaps a separate PR. What do you think? |
@morlovich any thoughts on #1695 (comment) ? |
net/instaweb/http/async_fetch.cc
Outdated
|
||
bool OutputSanitizingAsyncFetch::SanitizeResponseHeaders() { | ||
if (response_headers() != nullptr && | ||
response_headers()->RemoveAllWithPrefix("@")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put the "@" in a static class constant in Headers?
there you can put doc that this is illegal from a protocol perspective (https://tools.ietf.org/html/rfc2616#page-31) 2.2 Basic Rules for 'token' excludes separators, including "@". That's why we can use it as an in-memory sentinal as long as it is never serialized.
@@ -316,6 +316,26 @@ class SharedAsyncFetch : public AsyncFetch { | |||
DISALLOW_COPY_AND_ASSIGN(SharedAsyncFetch); | |||
}; | |||
|
|||
// Can be used to sanitize headers and data before forwarding them on to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Can be used/Used/
class OutputSanitizingAsyncFetch : public SharedAsyncFetch { | ||
public: | ||
explicit OutputSanitizingAsyncFetch(AsyncFetch* base_fetch); | ||
virtual ~OutputSanitizingAsyncFetch(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/virtual/override/ (modulo moving it to before the ';')
pagespeed/apache/header_util.cc
Outdated
@@ -115,6 +115,9 @@ void AddResponseHeadersToRequestHelper(const ResponseHeaders& response_headers, | |||
for (int i = 0, n = response_headers.NumAttributes(); i < n; ++i) { | |||
const GoogleString& name = response_headers.Name(i); | |||
const GoogleString& value = response_headers.Value(i); | |||
if (strings::StartsWith(name, "@")) { | |||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious: does this get hit in tests? the concerning thing here is whether there's something linking through PSOL into the integrations that we have to catch on each integration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmarantz
I can see that this line should be covered by a test (and perhaps give known implementors with ports similar to mod_pagespeed a head's up, those that do not use ProxyFetch to wire up html rewriting).
Currently this is not hit in tests, but in the next planned follow-up to this PR, it will be easy to add and end-to-end test because that is where the rubber will start hitting the road:
https://github.com/apache/incubator-pagespeed-mod/pull/1696/files#diff-923e5a0c8d9a28d13112afd3d77b35a0R239
Headers starting with '@' will be removed from any output emitted
by PSOL. This allows a concept of 'internal headers' which can be
used to pass information around internally.
In a follow-up PR this will be used to track (and remember) which
redirects have been followed while fetching a resource, and with that
we will be able to verify if using the resource as a basis for optimized
output would violate effective Content-Security-Policies, if any.
Note that a header is not allowed to contain '@' per http spec, and
if one does 'escape' httpd will respond with a 5xx.