Skip to content
This repository has been archived by the owner on Apr 21, 2023. It is now read-only.

OutputSanitizingAsyncFetch: runs right before PSOL responds #1695

Closed
wants to merge 3 commits into from

Conversation

oschaaf
Copy link
Member

@oschaaf oschaaf commented Dec 13, 2017

Headers starting with '@' will be removed from any output emitted
by PSOL. This allows a concept of 'internal headers' which can be
used to pass information around internally.

In a follow-up PR this will be used to track (and remember) which
redirects have been followed while fetching a resource, and with that
we will be able to verify if using the resource as a basis for optimized
output would violate effective Content-Security-Policies, if any.

Note that a header is not allowed to contain '@' per http spec, and
if one does 'escape' httpd will respond with a 5xx.

Headers starting with '@' will be removed from any output emitted
by PSOL. This allows a concept of 'internal headers' which can be
used to pass information around internally.

In a follow-up PR this will be used to track (and remember) which
redirects have been followed while fetching a resource, and with that
we will be able to verify if using the resource as a basis for optimized
output would violate effective Content-Security-Policies, if any.

Note that a header is not allowed to contain '@' per http spec, and
if one does 'escape' httpd will respond with a 5xx.
oschaaf added a commit that referenced this pull request Dec 14, 2017
Using php and/or using "Headers always" in configuration to set up CSP
headers in httpd is probably pretty common. These will both end up in
err_headers_out, a httpd-only concept meaning that the headers should
always be emitted, even on error statusses. We want to consider these
CSP headers living in this collection when rewriting content, but leave
them alone otherwise.
Instead of introducing an err_headers_out (like) concept in RewriteDriver
to model this, we will rely on the internal headers concept. ScanFilter
will also look at kInternalContentSecurityPolicy. This header will be
sanitized from our own output.

Relies on #1695
@morlovich
Copy link
Contributor

Hmm, I guess Apache code can just filters those when converting back?

@oschaaf
Copy link
Member Author

oschaaf commented Dec 18, 2017

@morlovich Yes, fixed that in another PR.
https://github.com/pagespeed/mod_pagespeed/pull/1696/files#diff-09994d3bc4ac0373770c34bfe92b93f2R118

However, I recently ran into a hard to reproduce CHECK failure, probably caused by one of these changes:

[1214/123730:FATAL:headers.cc(81)] Check failed: static_cast<size_t>(headers->size()) == needed.size() (7 vs. 6)
[ RUN      ] ProxyInterfaceTest.FlushHugeHtml

Perhaps the check failure is caused by OutputSanitizingAsyncFetch::HandleDone also stripping the headers, maybe it shouldn't do that when wrapping ProxyFetch, but I'm still working to understand how that could cause the above check failure.

}

void OutputSanitizingAsyncFetch::HandleDone(bool success) {
SanitizeResponseHeaders();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one might be dubious threading-wise? Though I think our story on this hasn't been very clean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welll... There is some documentation here:
https://github.com/pagespeed/mod_pagespeed/blob/master/net/instaweb/http/public/async_fetch.h#L114
.... which is not a very good place to have it.

At any rate thread issues seem very likely for this check if you look at how the code in RemoveAllWithPrefix looks like.

- Filter headers starting with '@' in AddResponseHeadersToRequestHelper
oschaaf added a commit that referenced this pull request Dec 19, 2017
Using php and/or using "Headers always" in configuration to set up CSP
headers in httpd is probably pretty common. These will both end up in
err_headers_out, a httpd-only concept meaning that the headers should
always be emitted, even on error statusses. We want to consider these
CSP headers living in this collection when rewriting content, but leave
them alone otherwise.
Instead of introducing an err_headers_out (like) concept in RewriteDriver
to model this, we will rely on the internal headers concept. ScanFilter
will also look at kInternalContentSecurityPolicy. This header will be
sanitized from our own output.

Relies on #1695
@oschaaf
Copy link
Member Author

oschaaf commented Dec 19, 2017

updated, this PR now includes the code to also strip internal headers in Apache, and OutputSanitizingAsyncFetch::HandleDone no longer touches the response headers.

@morlovich
Copy link
Contributor

Had a flash of a "dumb question": instead of adding headers with magic names and then having to care about removing them, why not just have a separate spot, like a second ResponseHeaders?

And I think I need to re-echo my earlier though that we may have more bugs with err_headers_out elsewhere..

@oschaaf
Copy link
Member Author

oschaaf commented Dec 21, 2017

@morlovich I think that is a great question..

Let me put down some thoughts: The initial reason reason for storing internal headers in the current ResponseHeaders set is that these get persisted automatically. That is a useful attribute for the redirect-following feature, because we want to remember how we got them so we will be able to evaluate CSP when we pull them from cache.

Then came the story about CSP often ending up in err_response_headers.The concept of err_headers out seemed a bit httpd-specific.
I was a bit afraid of ending up with response_headers, extra_response_headers, err_response_headers, and internal_headers. In the follow up to this PR I opted for storing the headers using the same magic naming scheme (@Content-Security-Policy) -- because for this we need to only read it and leave it alone after that.

Having said that: I can see that it would be useful if we would be able to remember which header came from where with respect to err_headers_out (which would help with the possible other bugs with err_headers_out you mention, I think?). Modelling err_response_headers may be useful to accomplish that. Though thinking about it, maybe I would discuss an alternative: adding a 'source' attribute (and maybe some more metadata) to header/value pairs contained in the current ResponseHeaders structure to make it easier for server implementations to correctly map back any rewritten headers we give them, and perhaps state our intent (maybe "write back/"don't write back", or even set/append/delete/skip). If we can state intent, we don't need magic names.

Considering the above, I'd lean towards thinking that this is stuff for more discussion maybe, and perhaps a separate PR. What do you think?

@oschaaf
Copy link
Member Author

oschaaf commented Jan 6, 2018

@morlovich any thoughts on #1695 (comment) ?


bool OutputSanitizingAsyncFetch::SanitizeResponseHeaders() {
if (response_headers() != nullptr &&
response_headers()->RemoveAllWithPrefix("@")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put the "@" in a static class constant in Headers?

there you can put doc that this is illegal from a protocol perspective (https://tools.ietf.org/html/rfc2616#page-31) 2.2 Basic Rules for 'token' excludes separators, including "@". That's why we can use it as an in-memory sentinal as long as it is never serialized.

@@ -316,6 +316,26 @@ class SharedAsyncFetch : public AsyncFetch {
DISALLOW_COPY_AND_ASSIGN(SharedAsyncFetch);
};

// Can be used to sanitize headers and data before forwarding them on to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Can be used/Used/

class OutputSanitizingAsyncFetch : public SharedAsyncFetch {
public:
explicit OutputSanitizingAsyncFetch(AsyncFetch* base_fetch);
virtual ~OutputSanitizingAsyncFetch();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/virtual/override/ (modulo moving it to before the ';')

@@ -115,6 +115,9 @@ void AddResponseHeadersToRequestHelper(const ResponseHeaders& response_headers,
for (int i = 0, n = response_headers.NumAttributes(); i < n; ++i) {
const GoogleString& name = response_headers.Name(i);
const GoogleString& value = response_headers.Value(i);
if (strings::StartsWith(name, "@")) {
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious: does this get hit in tests? the concerning thing here is whether there's something linking through PSOL into the integrations that we have to catch on each integration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmarantz
I can see that this line should be covered by a test (and perhaps give known implementors with ports similar to mod_pagespeed a head's up, those that do not use ProxyFetch to wire up html rewriting).

Currently this is not hit in tests, but in the next planned follow-up to this PR, it will be easy to add and end-to-end test because that is where the rubber will start hitting the road:
https://github.com/apache/incubator-pagespeed-mod/pull/1696/files#diff-923e5a0c8d9a28d13112afd3d77b35a0R239

@oschaaf
Copy link
Member Author

oschaaf commented Jan 9, 2018

@jmarantz processed your comments in eed1ff1
(barring the one about test-coverage, which I proposed to land in the follow up to this)

@oschaaf oschaaf closed this Aug 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants