Add AMP_DOM_Document & meta tag sanitizer #3758

schlessera · 2019-11-15T15:11:32Z

Summary

This PR adds an abstraction for the DOM document and a new sanitizer AMP_Meta_Sanitizer that sanitizes meta tags in general, but more specifically for now the charset tag.

Fixes #3469
Fixes #855

Checklist

My pull request is addressing an open issue (please create one otherwise).
My code is tested and passes existing tests.
My code follows the Engineering Guidelines (updates are often made to the guidelines, check it out periodically).

schlessera · 2019-11-15T15:16:28Z

Note to self: don't push into a PR you've initiated but haven't created yet, it will force-create an empty one.

schlessera · 2019-11-15T15:43:28Z

We already have most of the meta tag logic in AMP_Theme_Support, but in a rather haphazard way.

Right now, the above code seems to fix the main bug we were chasing, but it introduces duplicate code.

I'll make some more changes now to extract the meta tag handling out of the AMP_Theme_Support class and put them into the new sanitizer instead. This will improve the quality of the sanitizers for stand-alone use and simplify the procedural mess that happens in AMP_Theme_Support.

schlessera · 2019-11-15T16:11:11Z

Note: the code in the AMP_Theme_Support class already states that it would be better off in a sanitizer instead:

amp-wp/includes/class-amp-theme-support.php

Line 1536 in 3b70f6d

* @todo All of this might be better placed inside of a sanitizer.

schlessera · 2019-11-19T10:53:26Z

What was supposed to be a small refactor here turned out to be more of a mess than expected.

I started with a AMP_Meta_Sanitizer as the gist plugin that solved the initial bug used that, but I'm not happy with the current shape of that sanitizer.

I don't want to spend much more time on refactoring here, as I've already collected higher-level thoughts about this in #3763. I think one principle we should adhere to in future refactorings is to split sanitization (changing from incorrect into correct) from optimization (reordering and similar). The reordering is what not only causes the entire AMP_Theme_Support::ensure_required_markup() to be strictly procedural and hard to untangle, but also causes many other headaches when refactoring into more encapsulated code.

schlessera · 2019-11-19T10:58:54Z

There's two open issues here:

What do we do if we encounter a charset other than utf-8?

Do we try to convert the entire content encoding? Do we work on a known subset only that we can test? Do we just throw an error and force the user to deal with this before trying to use AMP?

What are the requirements for the viewport?

As already stated in Slack, the existing tests currently expect maximum-scale=1.0 to be sanitized away, while I can't find a reference in the documentation that states this as a requirement.

So, given the input <meta name="viewport" content="maximum-scale=1.0">, should the output be:
a.) <meta name="viewport" content="width=device-width"> (what the test expects)
b.) <meta name="viewport" content="width=device-width,maximum-scale=1.0"> (what I'm currently producing based on how I interpreted the docs.

westonruter · 2019-11-19T17:14:44Z

What are the requirements for the viewport?

As already stated in Slack, the existing tests currently expect maximum-scale=1.0 to be sanitized away, while I can't find a reference in the documentation that states this as a requirement.

So, given the input <meta name="viewport" content="maximum-scale=1.0">, should the output be:
a.) <meta name="viewport" content="width=device-width"> (what the test expects)
b.) <meta name="viewport" content="width=device-width,maximum-scale=1.0"> (what I'm currently producing based on how I interpreted the docs.

Here are the allowed properties as well as the required device-width=width:

https://github.com/ampproject/amphtml/blob/e135c18c9460d4834578d6809ae5f77a1e3a231a/validator/validator-main.protoascii#L436-L445

You're right that it <meta name="viewport" content="maximum-scale=1.0"> should ideally be sanitized as <meta name="viewport" content="width=device-width,maximum-scale=1.0">. What is happening now is a bit more clear cut since it detects an invalid meta tag and then removes it (while triggering a validation error) and then later the proper one is put in its place. I suppose this could be done equally well with conversion merging… if we encounter a meta viewport tag we parse the properties, and if it doesn't include width=device-width then we can still trigger the validation error, and if the validation error is sanitized then instead of removing the node we can instead merge the properties.

schlessera · 2019-11-19T17:16:19Z

The PR already does the merging, I just wanted to verify with you whether I got the requirements right because of the broken test.

westonruter · 2019-11-19T17:22:47Z

What do we do if we encounter a charset other than utf-8?

Do we try to convert the entire content encoding? Do we work on a known subset only that we can test? Do we just throw an error and force the user to deal with this before trying to use AMP?

Throwing a warning other than what we're currently doing?

amp-wp/includes/class-amp-theme-support.php

Lines 2361 to 2365 in f157602

    
           // @todo If 'utf-8' is not the blog charset, then we'll need to do some character encoding conversation or "entityification". 
        
           if ( 'utf-8' !== strtolower( get_bloginfo( 'charset' ) ) ) { 
        
           	/* translators: %s: the charset of the current site. */ 
        
           	trigger_error( esc_html( sprintf( __( 'The database has the %s encoding when it needs to be utf-8 to work with AMP.', 'amp' ), get_bloginfo( 'charset' ) ) ), E_USER_WARNING ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions.error_log_trigger_error 
        
           }

I think we should try to convert the entire encoding if possible. I was thinking that something like this should be possible (untested code):

diff --git a/includes/class-amp-theme-support.php b/includes/class-amp-theme-support.php
index 11b7d9a3..fb518e4c 100644
--- a/includes/class-amp-theme-support.php
+++ b/includes/class-amp-theme-support.php
@@ -2249,6 +2249,16 @@ class AMP_Theme_Support {
 			);
 		}
 
+		// AMP requires UTF-8, so convert encoding.
+		if ( strtolower( get_bloginfo( 'charset' ) ) !== 'utf-8' ) {
+			if ( function_exists( 'mb_convert_encoding' ) ) {
+				$response = mb_convert_encoding( $response, 'utf-8', get_bloginfo( 'charset' ) );
+			} else {
+				/* translators: %s: the charset of the current site. */
+				trigger_error( esc_html( sprintf( __( 'The database has the %s encoding when it needs to be utf-8 to work with AMP.', 'amp' ), get_bloginfo( 'charset' ) ) ), E_USER_WARNING ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions.error_log_trigger_error
+			}
+		}
+
 		$dom   = AMP_DOM_Utils::get_dom( $response );
 		$xpath = new DOMXPath( $dom );
 		$head  = $dom->getElementsByTagName( 'head' )->item( 0 );
@@ -2358,12 +2368,6 @@ class AMP_Theme_Support {
 			}
 		}
 
-		// @todo If 'utf-8' is not the blog charset, then we'll need to do some character encoding conversation or "entityification".
-		if ( 'utf-8' !== strtolower( get_bloginfo( 'charset' ) ) ) {
-			/* translators: %s: the charset of the current site. */
-			trigger_error( esc_html( sprintf( __( 'The database has the %s encoding when it needs to be utf-8 to work with AMP.', 'amp' ), get_bloginfo( 'charset' ) ) ), E_USER_WARNING ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions.error_log_trigger_error
-		}
-
 		AMP_Validation_Manager::finalize_validation(
 			$dom,
 			[

However, last I tried it I don't recall having success. It has been awhile however.

Even so, if UTF-8 is not the encoding, we should still at least issue a warning in Site Health.

westonruter · 2019-11-19T17:25:38Z

includes/sanitizers/class-amp-meta-sanitizer.php

+
+		$this->ensure_charset_is_present( $charset );
+
+		if ( ! $this->is_correct_charset() ) { // phpcs:ignore Generic.CodeAnalysis.EmptyStatement


I believe this will need to be done prior to the DOMDocument being constructed, but it didn't occur to me that perhaps DOMDocument allows for encoding to be changed after parsing. I don't think it facilitates this, however.

DOMDocument works with UTF-8 internally, so I think we need to have it converted before sending it to DOMDocument, or otherwise we might already have messed it up via loadHTML().

So then is this is_correct_ charset() check needed then? Re-encoding it into UTF-8 cannot be done at this point. It would need to be done earlier. This sanitizer just needs to make sure that the utf-8 meta charset is present.

Yes, I've put this into AMP_DOM_Document now.

westonruter · 2019-11-19T17:25:49Z

includes/sanitizers/class-amp-meta-sanitizer.php

+			static function ( $element ) {
+				return $element->parentNode->removeChild( $element );
+			},
+			iterator_to_array( $elements, false )


TIL! We could be using this lots of places where we're currently iterating over a DOMNodeList to push onto an array which we then iterate over to potentially remove elements. Removing elements from the DOM while iterating over a DOMNodeList is problematic since it is a live node list.

Yes, the DOMNodeList does not seem to have a built-in way to turn itself into an array, but any iterator can be turned into one via iterator_to_array().

Array/Traversable/Iterable handling in PHP is just a big mess, unfortunately.

westonruter · 2019-11-19T17:33:32Z

includes/sanitizers/class-amp-meta-sanitizer.php

+	 *
+	 * @return DOMElement The document's <head> element.
+	 */
+	protected function ensure_head_is_present() {


I wonder if this is, in fact, needed, because now we prevent processing responses when no <head> is present:

amp-wp/includes/class-amp-theme-support.php

Lines 2052 to 2058 in f157602

/*

* Abort if the response was not HTML. To be post-processed as an AMP page, the output-buffered document must

* have the HTML mime type and it must start with <html> followed by <head> tag (with whitespace, doctype, and comments optionally interspersed).

*/

if ( 'text/html' !== substr( AMP_HTTP::get_response_content_type(), 0, 9 ) || ! preg_match( '#^(?:<!.*?>|\s+)*<html.*?>(?:<!.*?>|\s+)*<head\b(.*?)>#is', $response ) ) {

return $response;

}

I think we should treat the sanitizers independently from the AMP_Theme_Support processing here, as we want to produce a stand-alone sanitizer library in the long term.

While it is perfectly fine to do early bails in AMP_Theme_Support for various reasons as part of the "WP integration" of the plugin, I think that "asserting" the requirements in the sanitizer should still be done nevertheless.

OK, good call. 👍

includes/sanitizers/class-amp-meta-sanitizer.php

westonruter · 2019-11-19T17:38:16Z

includes/sanitizers/class-amp-meta-sanitizer.php

+			return;
+		}
+
+		$this->meta_tags[ self::TAG_CHARSET ][] = $this->create_charset_element( $charset ?: static::AMP_CHARSET );


I think this can just always use static::AMP_CHARSET because this is the only thing that AMP allows (utf-8).

Yes, but only if we do indeed transform the document first. Otherwise, we're not only in the wrong charset, but we've also lost the information about what the actual charset is => broken^2. :)

westonruter · 2019-11-19T17:48:30Z

includes/sanitizers/class-amp-meta-sanitizer.php

+	protected function ensure_viewport_is_present() {
+		if ( empty( $this->meta_tags[ self::TAG_VIEWPORT ] ) ) {
+			$this->meta_tags[ self::TAG_VIEWPORT ][] = $this->create_viewport_element( static::AMP_VIEWPORT );
+			return;
+		}
+
+		// Ensure we have the 'width=device-width' setting included.
+		$viewport_tag      = $this->meta_tags[ self::TAG_VIEWPORT ][0];
+		$viewport_content  = $viewport_tag->getAttribute( 'content' );
+		$viewport_settings = array_map( 'trim', explode( ',', $viewport_content ) );
+		$width_found       = false;
+
+		foreach ( $viewport_settings as $index => $viewport_setting ) {
+			list( $property, $value ) = array_map( 'trim', explode( '=', $viewport_setting ) );
+			if ( 'width' === $property ) {
+				if ( 'device-width' !== $value ) {
+					$viewport_settings[ $index ] = 'width=device-width';
+				}
+				$width_found = true;
+				break;
+			}
+		}
+
+		if ( ! $width_found ) {
+			array_unshift( $viewport_settings, 'width=device-width' );
+		}
+
+		$viewport_tag->setAttribute( 'content', implode( ',', $viewport_settings ) );
+	}


This will need to obtain the tag spec for this meta tag essentially incorporate this logic:

amp-wp/includes/sanitizers/class-amp-tag-and-attribute-sanitizer.php

Lines 1593 to 1652 in f157602

/**

* Check if attribute has valid properties.

*

* @since 0.7

*

* @param DOMElement $node Node.

* @param string $attr_name Attribute name.

* @param array[]|string[] $attr_spec_rule Attribute spec rule.

*

* @return string:

* - AMP_Rule_Spec::PASS - $attr_name has a value that matches the rule.

* - AMP_Rule_Spec::FAIL - $attr_name has a value that does *not* match rule.

* - AMP_Rule_Spec::NOT_APPLICABLE - $attr_name does not exist or there

* is no rule for this attribute.

*/

private function check_attr_spec_rule_value_properties( DOMElement $node, $attr_name, $attr_spec_rule ) {

if ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_PROPERTIES ] ) && $node->hasAttribute( $attr_name ) ) {

$properties = [];

foreach ( explode( ',', $node->getAttribute( $attr_name ) ) as $pair ) {

$pair_parts = explode( '=', $pair, 2 );

if ( 2 !== count( $pair_parts ) ) {

return 0;

}

$properties[ strtolower( trim( $pair_parts[0] ) ) ] = trim( $pair_parts[1] );

}

// Fail if there are unrecognized properties.

if ( count( array_diff( array_keys( $properties ), array_keys( $attr_spec_rule[ AMP_Rule_Spec::VALUE_PROPERTIES ] ) ) ) > 0 ) {

return AMP_Rule_Spec::FAIL;

}

foreach ( $attr_spec_rule[ AMP_Rule_Spec::VALUE_PROPERTIES ] as $prop_name => $property_spec ) {

// Mandatory property is missing.

if ( ! empty( $property_spec['mandatory'] ) && ! isset( $properties[ $prop_name ] ) ) {

return AMP_Rule_Spec::FAIL;

}

if ( ! isset( $properties[ $prop_name ] ) ) {

continue;

}

$prop_value = $properties[ $prop_name ];

// Required value is absent, so fail.

$required_value = null;

if ( isset( $property_spec['value'] ) ) {

$required_value = $property_spec['value'];

} elseif ( isset( $property_spec['value_double'] ) ) {

$required_value = $property_spec['value_double'];

$prop_value = (float) $prop_value;

}

if ( isset( $required_value ) && $prop_value !== $required_value ) {

return AMP_Rule_Spec::FAIL;

}

}

return AMP_Rule_Spec::PASS;

}

return AMP_Rule_Spec::NOT_APPLICABLE;

}

Otherwise, if someone creates a meta tag like:

<meta name=viewport content="width=content-width,BOGUS=SUPER_BOGUS">

Then the tag-and-attribute sanitizer will end up removing this meta tag that you created here.

In order to facilitate obtaining the tag spec for meta name=viewport, instead of iterating over all meta tags it and finding the one that has that spec_name, it would be useful to perhaps to refactor the spec generation logic to use the spec_name as the array key, like so:

--- a/includes/sanitizers/class-amp-allowed-tags-generated.php +++ b/includes/sanitizers/class-amp-allowed-tags-generated.php @@ -10106,7 +10106,7 @@ class AMP_Allowed_Tags_Generated { 'unique' => true, ), ), - array( + 'meta name=viewport' => array( 'attr_spec_list' => array( 'content' => array( 'mandatory' => true, @@ -10135,7 +10135,6 @@ class AMP_Allowed_Tags_Generated { 'tag_spec' => array( 'mandatory' => true, 'mandatory_parent' => 'head', - 'spec_name' => 'meta name=viewport', 'spec_url' => 'https://amp.dev/documentation/guides-and-tutorials/learn/spec/amphtml#required-markup', 'unique' => true, ),

Then we could extend the \AMP_Allowed_Tags_Generated::get_allowed_tag() with a new $spec_name arg:

--- a/includes/sanitizers/class-amp-allowed-tags-generated.php +++ b/includes/sanitizers/class-amp-allowed-tags-generated.php @@ -17625,10 +17625,13 @@ class AMP_Allowed_Tags_Generated { * * @since 0.7 * @param string $node_name Tag name. + * @param string $spec_name Spec name, to reduce the tag specs to a single one. * @return array|null Allowed tag, or null if the tag does not exist. */ - public static function get_allowed_tag( $node_name ) { - if ( isset( self::$allowed_tags[ $node_name ] ) ) { + public static function get_allowed_tag( $node_name, $spec_name = null ) { + if ( isset( $spec_name, self::$allowed_tags[ $node_name ][ $spec_name ] ) ) { + return self::$allowed_tags[ $node_name ][ $spec_name ]; + } elseif ( isset( self::$allowed_tags[ $node_name ] ) ) { return self::$allowed_tags[ $node_name ]; } return null;

And then here in the ensure_viewport_is_present it could easily locate the tag spec via:

$tag_spec = \AMP_Allowed_Tags_Generated::get_allowed_tag( 'meta', 'meta name=viewport' );

This would be useful elsewhere that we refer to tag specs, including in the style sanitizer.

More generally, would it be possible to refactor the tag & meta sanitizer to strip the offending properties only, and then check if the entire thing is still valid or not? This way, we wouldn't have to duplicate this logic all over the place...

it would be useful to perhaps to refactor the spec generation logic to use the spec_name as the array key

Is this guaranteed not to have collisions?

More generally, would it be possible to refactor the tag & meta sanitizer to strip the offending properties only, and then check if the entire thing is still valid or not? This way, we wouldn't have to duplicate this logic all over the place...

Yes, this could be done. There could be validation errors with type invalid_meta_property. I like that.

Is this guaranteed not to have collisions?

It turns out, yes! Sometimes the spec_name is omitted when it is just a regular HTML element, I think. And for script we just need to derive a spec_name from the extension_spec. When this is done, all tag specs are unique: https://gist.github.com/westonruter/8e6a8ca427c69b23cdd8e26dcccbfc3d

As discussed during the plugin sync, I'll tackle the enhanced validation-stripping-granularity in a separate PR to keep this one moderately sane.

Related: #3780. May be best to wait to work on this until that PR is merged.

Actually, I can include the property sanitization as part of #3780.

Or rather a subsequent PR built off of that PR.

Opened a new issue to track this: #4070

includes/sanitizers/class-amp-meta-sanitizer.php

includes/amp-helper-functions.php

westonruter · 2019-11-22T18:09:38Z

includes/utils/class-amp-dom-document.php

+		// Force-add http-equiv charset to make DOMDocument behave as it should.
+		// See: http://php.net/manual/en/domdocument.loadhtml.php#78243.
+		$source = str_replace(
+			'<head>',


What if $source has a <head> that hsa attributes like <head profile="http://www.acme.com/profiles/core">

I modified this to use a regex to match any head tag. However, tests revealed now that DOMDocument seems to strip all attributes from the head tag...

westonruter · 2019-11-22T18:12:22Z

includes/utils/class-amp-dom-document.php

+			if ( false === strpos( $substring, '<head>' ) ) {
+				// Create the required HTML structure if none exists yet.
+				$content = "<html><head></head><body>{$content}</body></html>";
+			} else {
+				// <head> seems to be present without <body>.
+				$content = preg_replace( '#</head>(.*)</html>#', '</head><body>$1</body>', $content );
+			}
+		} elseif ( false === strpos( $substring, '<head>' ) ) {
+			// Create a <head> element if none exists yet.
+			$content = str_replace( '<body', '<head></head><body', $content );


What if <head> is present but it contains attributes, like:

<head profile='http://www.acme.com/profiles/core'>

Yes, will change to a preg_replace().

westonruter · 2019-11-22T18:15:37Z

includes/utils/class-amp-dom-document.php

+		);
+		$head->insertBefore( $charset, $head->firstChild );
+
+		return str_replace( '<meta http-equiv="content-type" content="text/html; charset=' . self::AMP_ENCODING . '">', '', parent::saveHTML( $node ) );


This replacement key makes me a bit nervous, as it means we have have to rely on libxml always serializing attributes the same way. As long as we have tests for it, then I suppose nothing to worry about.

It's a tag with two string-based attributes. I'm not sure how far off this could get.

However, I'll replace it with a preg_replace() to make it case-insensitive and "quote-insensitive".

westonruter · 2019-11-22T18:16:59Z

includes/utils/class-amp-dom-document.php

+		if ( $success ) {
+			$this->encoding = self::AMP_ENCODING;
+			$head           = $this->getElementsByTagName( 'head' )->item( 0 );
+			$head->removeChild( $head->firstChild );


What if the above str_replace() didn't actually do any replacement. It could remove the wrong node here.

I'll add a safeguard around it.

westonruter · 2019-11-22T18:17:59Z

includes/utils/class-amp-dom-document.php

+		$success = parent::loadHTML( $source, $options );
+
+		if ( $success ) {
+			$this->encoding = self::AMP_ENCODING;


Wouldn't DOMDocument populate this? Why is this needed?

It's one of the many bits I found online that might or might not improve DOMDocument behavior with UTF-8.

However, I don't think this does much, so I'll remove it.

westonruter · 2019-11-22T18:20:57Z

includes/utils/class-amp-dom-document.php

+			$this->original_encoding = mb_detect_encoding( $source );
+		}
+
+		// Guessing the encoding seems to have failed, so we assume UTF-8 instead.
+		if ( empty( $this->original_encoding ) ) {
+			$this->original_encoding = self::AMP_ENCODING;
+		}
+
+		$this->original_encoding = $this->sanitize_encoding( $this->original_encoding );
+
+		$target = mb_convert_encoding( $source, self::AMP_ENCODING, $this->original_encoding );


This will need to short-circuit if the mbstring extension is not loaded, since we do not currently include it among the $_amp_required_extensions.

Actually, I suggest that if a document is not utf-8 and we cannot convert into utf-8 that we throw an exception. AMP would just not be available.

westonruter · 2019-11-22T18:22:49Z

includes/utils/class-amp-dom-document.php

+			$http_equiv_tag && str_replace( $http_equiv_tag, '', $content );
+			$charset_tag && str_replace( $charset_tag, '', $content );


These seem to be unused expressions. Shouldn't they be assigned to something?

These are string replacements that only happen when the strings evaluate to true. Too cryptic?

The string replacements aren't done in place though, as $content is not passed by reference. I would expect something like this:

if ( $charset_tag ) { $content = str_replace( $charset_tag, '', $content ); }

You're right, this is nonsense. I'll change it.

westonruter · 2019-11-22T18:23:57Z

includes/utils/class-amp-dom-document.php

+			case 'xpath':
+				$this->xpath = new DOMXPath( $this );
+				return $this->xpath;


westonruter · 2019-11-22T18:24:20Z

includes/utils/class-amp-dom-document.php

+ *
+ * @since 1.5
+ *
+ * @property DOMXpath $xpath XPath query object for this document.


Suggested change

* @property DOMXpath $xpath XPath query object for this document.

* @property DOMXPath $xpath XPath query object for this document.

includes/utils/class-amp-dom-document.php

schlessera · 2019-11-23T09:23:30Z

Yes, some more hardening is needed here, this was the first attempt that finally passed all the tests.

Also, I need write a lot of additional tests for the new code, move some tests over, and think about how to best test the actual encoding conversion.

schlessera · 2019-11-27T17:56:12Z

There's still a few minor kinks to figure out, but I think I've got general automated character set conversion working.

Also, the new AMP_DOM_Document provides a way of centralizing some operations that are reused again and again, saving some processing time. I've done a preliminary change for fetching $xpath, $head and $body for some classes now.

westonruter · 2019-11-28T06:32:02Z

includes/class-amp-autoloader.php

@@ -96,6 +97,7 @@ class AMP_Autoloader {
 		'AMP_Content'                        => 'includes/templates/class-amp-content',
 		'AMP_Content_Sanitizer'              => 'includes/templates/class-amp-content-sanitizer',
 		'AMP_Post_Template'                  => 'includes/templates/class-amp-post-template',
+		'AMP_DOM_Document'                   => 'includes/utils/class-amp-dom-document',


Should this go into the Amp\AmpWP namespace now that we have it? Or is this premature?

No, I think it makes sense, if we can agree that we'll use this new abstraction everywhere going forward.

I assume it should be Amp\AmpWP\DOMDocument, due to its importance?

Sure, or what about Amp\AmpWP\DOM\Document?

If we use a subnamespace like that, we would also add DOMElementList in there as well.

It would allow us to important the namespace only and then use the relative DOM subnamespace in the code, if we want:

use Amp\AmpWP\DOM; $dom = new DOM\Document(); $list = ( new DOM\ElementList() ) ->add( $image1 ) ->add( $image2 );

However, we might want to go with Amp\AmpWP\Dom\Document to match Amp. It would make the result less "shouty".

includes/class-amp-theme-support.php

westonruter · 2019-11-28T06:35:48Z

includes/sanitizers/class-amp-meta-sanitizer.php

+	/**
+	 * Placeholder for default arguments, to be set in child classes.
+	 *
+	 * @var array
+	 */
+	protected $DEFAULT_ARGS = [ // phpcs:ignore WordPress.NamingConventions.ValidVariableName.PropertyNotSnakeCase
+		'use_document_element' => true, // We want to work on the header, so we need the entire document.
+	];


Not sure if this will have any practical effect. Since we're not referencing $this-args['use_document_element'] in the sanitizer, this is just setting a member which is going to be either unused or just overridden in the first place.

Yes, I think this can be removed now.

westonruter · 2019-11-28T06:38:57Z

includes/sanitizers/class-amp-meta-sanitizer.php

+	protected function ensure_viewport_is_present() {
+		if ( empty( $this->meta_tags[ self::TAG_VIEWPORT ] ) ) {
+			$this->meta_tags[ self::TAG_VIEWPORT ][] = $this->create_viewport_element( static::AMP_VIEWPORT );
+			return;
+		}
+
+		// Ensure we have the 'width=device-width' setting included.
+		$viewport_tag      = $this->meta_tags[ self::TAG_VIEWPORT ][0];
+		$viewport_content  = $viewport_tag->getAttribute( 'content' );
+		$viewport_settings = array_map( 'trim', explode( ',', $viewport_content ) );
+		$width_found       = false;
+
+		foreach ( $viewport_settings as $index => $viewport_setting ) {
+			list( $property, $value ) = array_map( 'trim', explode( '=', $viewport_setting ) );
+			if ( 'width' === $property ) {
+				if ( 'device-width' !== $value ) {
+					$viewport_settings[ $index ] = 'width=device-width';
+				}
+				$width_found = true;
+				break;
+			}
+		}
+
+		if ( ! $width_found ) {
+			array_unshift( $viewport_settings, 'width=device-width' );
+		}
+
+		$viewport_tag->setAttribute( 'content', implode( ',', $viewport_settings ) );
+	}


Or rather a subsequent PR built off of that PR.

westonruter · 2019-11-28T06:48:29Z

includes/utils/class-amp-dom-document.php

+				$this->head = $this->getElementsByTagName( 'head' )->item( 0 );
+				return $this->head;
+			case 'body':
+				$this->body = $this->getElementsByTagName( 'body' )->item( 0 );


If $this->head or $this->body are null, should this create those elements just in time?

It does. We're not defining the internal properties $head and $body, so the first time around, they don't exist and a call to $this->head will trigger the magic __get(). Within the magic getter, we dynamically set the property $head, so the next call will immediately retrieve it and not hit the magic getter anymore.

westonruter · 2019-11-28T06:49:22Z

includes/utils/class-amp-dom-document.php

+		}
+
+		// Mimic regular PHP behavior for missing notices.
+		trigger_error( "Undefined property: AMP_DOM_Document::${$name}", E_NOTICE ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions,WordPress.Security.EscapeOutput


Suggested change

trigger_error( "Undefined property: AMP_DOM_Document::${$name}", E_NOTICE ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions,WordPress.Security.EscapeOutput

trigger_error( "Undefined property: " . __CLASS__ . "::${$name}", E_NOTICE ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions,WordPress.Security.EscapeOutput

westonruter · 2019-11-28T06:50:32Z

includes/utils/class-amp-dom-utils.php

-		// Make sure there is a head and a body.
-		$head = $dom->getElementsByTagName( 'head' )->item( 0 );
-		if ( ! $head ) {
-			$head = $dom->createElement( 'head' );
-			$dom->documentElement->insertBefore( $head, $dom->documentElement->firstChild );
-		}
-		$body = $dom->getElementsByTagName( 'body' )->item( 0 );
-		if ( ! $body ) {
-			$body = $dom->createElement( 'body' );
-			$dom->documentElement->appendChild( $body );
-		}


Is the ensuring of the body and head being done now?

The AMP_DOM_Document now enforces a base structure:

amp-wp/includes/utils/class-amp-dom-document.php

Lines 170 to 188 in 7513e41

/**

* Normalize the document structure.

*

* This makes sure the document adheres to the general structure that AMP requires:

* ```

* <!doctype html>

* <html>

* <head>

* <meta charset="utf-8">

* </head>

* <body>

* </body>

* </html>

* ```

*

* @param string $content Content to normalize the structure of.

* @return string Normalized content.

*/

private function normalize_document_structure( $content ) {

westonruter · 2019-11-29T16:07:48Z

Sounds good to me.

schlessera · 2019-12-05T09:06:04Z

⚠️ Rebasing now to get access to the new namespace code.

schlessera · 2019-12-09T17:02:22Z

⚠️ Rebasing now to get resolve conflicts.

pierlon · 2019-12-11T07:07:20Z

Hey @schlessera, reverting 4b39169 does resolve the failing Twitter embed tests, but I'm not sure as to why they are failing in the first place as yet.

src/Dom/Document.php

tests/php/test-class-amp-dom-document.php

src/Dom/Document.php

tests/php/test-class-amp-dom-document.php

src/Dom/Document.php

westonruter · 2019-12-18T23:21:43Z

Alpha build for testing: amp.zip (v1.5.0-alpha-20191218T231919Z-8cbe6d60)

westonruter · 2019-12-19T05:07:11Z

Cloning the Dom\Document can lead to such edge cases, so this is also something to use with caution until we were able to play around more with it.

Wouldn't this be addressed by implementing __clone()? For example:

function __clone() {
    unset( $this->xpath, $this->head, $this->body );
}

This will reset those properties so that the next time the getter is invoked, they will get re-populated.

Co-Authored-By: Weston Ruter <westonruter@google.com>

schlessera · 2019-12-19T06:11:19Z

The <meta charset="utf-8"> is not strictly necessary according to the AMP spec, but I think it is a good practice to just always have it around, especially as we now might have a different charset for the AMP page than for the non-AMP page.

Actually, it is required according to the spec:

Ah, yes, I mixed stuff up in my head. In terms of parsing the DOMDocument, it is not required (as it doesn't work properly, actually, and DOMDocument needs an HTML4 http-equiv) and adding it would normally happen in sanitizers.

But, still, I think it is good to have here, and when we agree it is worth changing the existing tests for that, then its settled.

schlessera · 2019-12-19T06:13:26Z

Cloning the Dom\Document can lead to such edge cases, so this is also something to use with caution until we were able to play around more with it.

Wouldn't this be addressed by implementing __clone()? For example:
function __clone() {
    unset( $this->xpath, $this->head, $this->body );
}
This will reset those properties so that the next time the getter is invoked, they will get re-populated.

I thought about that but was wondering how far that rabbit-hole might take me if the DOMDocument does by itself already cause issues. But I can try adding this and reverting the test change to see what we'll get.

schlessera · 2019-12-19T08:46:38Z

@westonruter All of your feedback from the review should be addressed now. Also, I got rid of the dependency on AMP_DOM_Utils in Dom\Document, as I think that doesn't make sense and creates a circular dependency.

src/Dom/Document.php

tests/php/test-class-amp-dom-document.php

schlessera · 2019-12-19T14:34:45Z

Note: I force-pushed because I messed up a commit.

westonruter · 2019-12-19T15:31:29Z

Great job on this large effort.

westonruter · 2020-01-14T03:11:17Z

src/Dom/Document.php

+	const HTML_STRUCTURE_DOCTYPE_PATTERN = '/^[^<]*<!doctype(?:\s+[^>]+)?>/i';
+	const HTML_STRUCTURE_HTML_START_TAG  = '/^[^<]*(?<html_start><html(?:\s+[^>]*)?>)/i';
+	const HTML_STRUCTURE_HTML_END_TAG    = '/(?:<\/html(?:\s+[^>]*)?>)[^<>]*$/i';
+	const HTML_STRUCTURE_HEAD_START_TAG  = '/^[^<]*(?:<head(?:\s+[^>]*)?>)/i';
+	const HTML_STRUCTURE_BODY_START_TAG  = '/^[^<]*(?:<body(?:\s+[^>]*)?>)/i';
+	const HTML_STRUCTURE_BODY_END_TAG    = '/(?:<\/body(?:\s+[^>]*)?>)[^<>]*$/i';
+	const HTML_STRUCTURE_HEAD_TAG        = '/^(?:[^<]*(?:<head(?:\s+[^>]*)?>).*?<\/head(?:\s+[^>]*)?>)/is';


These patterns aren't all accounting for the inclusion of HTML comments which are added when doing validation requests. This results in HTML documents being corrupted when validating. See #4104.

westonruter · 2020-04-01T20:37:03Z

includes/sanitizers/class-amp-meta-sanitizer.php

+	 * Sanitize.
+	 */
+	public function sanitize() {
+		$elements = $this->dom->getElementsByTagName( static::$tag );


This caused a bug. See #4502.

googlebot added the cla: yes Signed the Google CLA label Nov 15, 2019

schlessera changed the title ~~## Summary~~ Add meta tag sanitizer to deal with http-equiv charsets Nov 15, 2019

schlessera changed the title ~~Add meta tag sanitizer to deal with http-equiv charsets~~ [WIP] Add meta tag sanitizer to deal with http-equiv charsets Nov 15, 2019

schlessera added Sanitizers Bug Something isn't working labels Nov 15, 2019

schlessera mentioned this pull request Nov 15, 2019

Refactor AMP_Theme_Support::ensure_required_markup() into sanitizers #3763

Closed

8 tasks

schlessera changed the title ~~[WIP] Add meta tag sanitizer to deal with http-equiv charsets~~ [WIP] Add meta tag sanitizer Nov 18, 2019

westonruter reviewed Nov 19, 2019

View reviewed changes

westonruter reviewed Nov 21, 2019

View reviewed changes

includes/amp-helper-functions.php Show resolved Hide resolved

westonruter reviewed Nov 22, 2019

View reviewed changes

westonruter mentioned this pull request Nov 26, 2019

Improve validating sanitizer with context for why element/attribute is invalid #3780

Merged

3 tasks

schlessera changed the title ~~[WIP] Add meta tag sanitizer~~ [WIP] Add AMP_DOM_Document & meta tag sanitizer Nov 27, 2019

westonruter reviewed Nov 28, 2019

View reviewed changes

schlessera force-pushed the fix/3469-convert-http-equiv branch from 7513e41 to ed2ef59 Compare December 5, 2019 09:08

schlessera mentioned this pull request Dec 5, 2019

Prevent wpautop() from modifying Twitter embed #3874

Merged

3 tasks

schlessera force-pushed the fix/3469-convert-http-equiv branch from bbf08a3 to 4bcad65 Compare December 9, 2019 17:08

westonruter requested changes Dec 18, 2019

View reviewed changes

Apply suggestions from code review

fbd790f

Co-Authored-By: Weston Ruter <westonruter@google.com>

schlessera added 7 commits December 19, 2019 07:49

Move is_valid_head_node() into Dom\Document

41094b4

Return both content & encoding via array in detect_and_strip_encoding()

6ff286c

Optimize sanitize_encoding()

e94e3ca

Actually use charset in document test

c5415d2

Always use imported relative class name Document in @Covers annotations

7c037a9

Reset internal optimizations on clone

8abb927

Avoid depending on AMP_DOM_Utils in Dom\Document

a214a25

westonruter reviewed Dec 19, 2019

View reviewed changes

src/Dom/Document.php Outdated Show resolved Hide resolved

tests/php/test-class-amp-dom-document.php Outdated Show resolved Hide resolved

schlessera added 2 commits December 19, 2019 15:32

Remove double line break

46af79f

Make reset() private

c739df0

schlessera force-pushed the fix/3469-convert-http-equiv branch from b70ea78 to c739df0 Compare December 19, 2019 14:33

westonruter approved these changes Dec 19, 2019

View reviewed changes

westonruter merged commit 8566e42 into develop Dec 19, 2019

westonruter deleted the fix/3469-convert-http-equiv branch December 19, 2019 15:32

westonruter added this to the v1.5 milestone Dec 19, 2019

westonruter mentioned this pull request Dec 19, 2019

Add support for amp-bind #895

Merged

4 tasks

This was referenced Jan 12, 2020

Add validation of individual properties in meta content attributes #4070

Closed

Validation broken in Genesis themes by Document::normalize_document_structure() #4104

Closed

westonruter reviewed Jan 14, 2020

View reviewed changes

westonruter added the Changelogged label Mar 23, 2020

westonruter reviewed Apr 1, 2020

View reviewed changes


		$this->ensure_charset_is_present( $charset );

		if ( ! $this->is_correct_charset() ) { // phpcs:ignore Generic.CodeAnalysis.EmptyStatement

	/*
	* Abort if the response was not HTML. To be post-processed as an AMP page, the output-buffered document must
	* have the HTML mime type and it must start with <html> followed by <head> tag (with whitespace, doctype, and comments optionally interspersed).
	*/
	if ( 'text/html' !== substr( AMP_HTTP::get_response_content_type(), 0, 9 ) \|\| ! preg_match( '#^(?:<!.?>\|\s+)<html.?>(?:<!.?>\|\s+)<head\b(.?)>#is', $response ) ) {
	return $response;
	}

	/**
	* Check if attribute has valid properties.
	*
	* @since 0.7
	*
	* @param DOMElement $node Node.
	* @param string $attr_name Attribute name.
	* @param array[]\|string[] $attr_spec_rule Attribute spec rule.
	*
	* @return string:
	* - AMP_Rule_Spec::PASS - $attr_name has a value that matches the rule.
	* - AMP_Rule_Spec::FAIL - $attr_name has a value that does not match rule.
	* - AMP_Rule_Spec::NOT_APPLICABLE - $attr_name does not exist or there
	* is no rule for this attribute.
	*/
	private function check_attr_spec_rule_value_properties( DOMElement $node, $attr_name, $attr_spec_rule ) {
	if ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_PROPERTIES ] ) && $node->hasAttribute( $attr_name ) ) {
	$properties = [];
	foreach ( explode( ',', $node->getAttribute( $attr_name ) ) as $pair ) {
	$pair_parts = explode( '=', $pair, 2 );
	if ( 2 !== count( $pair_parts ) ) {
	return 0;
	}
	$properties[ strtolower( trim( $pair_parts[0] ) ) ] = trim( $pair_parts[1] );
	}

	// Fail if there are unrecognized properties.
	if ( count( array_diff( array_keys( $properties ), array_keys( $attr_spec_rule[ AMP_Rule_Spec::VALUE_PROPERTIES ] ) ) ) > 0 ) {
	return AMP_Rule_Spec::FAIL;
	}

	foreach ( $attr_spec_rule[ AMP_Rule_Spec::VALUE_PROPERTIES ] as $prop_name => $property_spec ) {

	// Mandatory property is missing.
	if ( ! empty( $property_spec['mandatory'] ) && ! isset( $properties[ $prop_name ] ) ) {
	return AMP_Rule_Spec::FAIL;
	}

	if ( ! isset( $properties[ $prop_name ] ) ) {
	continue;
	}

	$prop_value = $properties[ $prop_name ];

	// Required value is absent, so fail.
	$required_value = null;
	if ( isset( $property_spec['value'] ) ) {
	$required_value = $property_spec['value'];
	} elseif ( isset( $property_spec['value_double'] ) ) {
	$required_value = $property_spec['value_double'];
	$prop_value = (float) $prop_value;
	}
	if ( isset( $required_value ) && $prop_value !== $required_value ) {
	return AMP_Rule_Spec::FAIL;
	}
	}
	return AMP_Rule_Spec::PASS;
	}
	return AMP_Rule_Spec::NOT_APPLICABLE;
	}

		$http_equiv_tag && str_replace( $http_equiv_tag, '', $content );
		$charset_tag && str_replace( $charset_tag, '', $content );

	* @property DOMXpath $xpath XPath query object for this document.
	* @property DOMXPath $xpath XPath query object for this document.

	trigger_error( "Undefined property: AMP_DOM_Document::${$name}", E_NOTICE ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions,WordPress.Security.EscapeOutput
	trigger_error( "Undefined property: " . __CLASS__ . "::${$name}", E_NOTICE ); // phpcs:ignore WordPress.PHP.DevelopmentFunctions,WordPress.Security.EscapeOutput

	/**
	* Normalize the document structure.
	*
	* This makes sure the document adheres to the general structure that AMP requires:
	* ```
	* <!doctype html>
	* <html>
	* <head>
	* <meta charset="utf-8">
	* </head>
	* <body>
	* </body>
	* </html>
	* ```
	*
	* @param string $content Content to normalize the structure of.
	* @return string Normalized content.
	*/
	private function normalize_document_structure( $content ) {

Add AMP_DOM_Document & meta tag sanitizer #3758

Add AMP_DOM_Document & meta tag sanitizer #3758

Conversation

schlessera commented Nov 15, 2019 • edited Loading

Summary

Checklist

schlessera commented Nov 15, 2019

schlessera commented Nov 15, 2019

schlessera commented Nov 15, 2019

schlessera commented Nov 19, 2019

schlessera commented Nov 19, 2019

westonruter commented Nov 19, 2019

schlessera commented Nov 19, 2019

westonruter commented Nov 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schlessera Nov 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schlessera commented Nov 23, 2019

schlessera commented Nov 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schlessera Nov 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonruter commented Nov 29, 2019 via email

schlessera commented Dec 5, 2019

schlessera commented Dec 9, 2019

pierlon commented Dec 11, 2019

westonruter commented Dec 18, 2019

westonruter commented Dec 19, 2019

schlessera commented Dec 19, 2019

schlessera commented Dec 19, 2019

schlessera commented Dec 19, 2019

schlessera commented Dec 19, 2019

westonruter commented Dec 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schlessera commented Nov 15, 2019 •

edited

Loading

schlessera Nov 20, 2019 •

edited

Loading

schlessera Nov 29, 2019 •

edited

Loading