New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent schema.org duplicates #992

Merged
merged 4 commits into from Mar 13, 2018

Conversation

Projects
None yet
2 participants
@oscarssanchez
Copy link
Contributor

oscarssanchez commented Mar 8, 2018

Request For Review

Hi @westonruter,

Could you use this PR, which addresses #962?

Originally, I used json_decode( $script->nodeValue ) to convert the schema.org json data to an array. However, that brought some issues with get_mustache_tag_placeholders(), as it adds some extra markup to the schema.org script tag.

I opted to use preg_match() as I saw the schema.org URL sometimes can either start with http or https.

Fixes #962.

@westonruter
Copy link
Member

westonruter left a comment

Thanks for the PR!

@@ -14,7 +14,6 @@ function amp_post_template_init_hooks() {
add_action( 'amp_post_template_head', 'amp_post_template_add_scripts' );
add_action( 'amp_post_template_head', 'amp_post_template_add_fonts' );
add_action( 'amp_post_template_head', 'amp_post_template_add_boilerplate_css' );
add_action( 'amp_post_template_head', 'amp_print_schemaorg_metadata' );

This comment has been minimized.

@westonruter

westonruter Mar 8, 2018

Member

This should not be removed. The legacy post templates should remain as-is, I believe. The ensure_required_markup method below does not apply for the legacy post template actions.

This comment has been minimized.

@oscarssanchez

oscarssanchez Mar 9, 2018

Contributor

Hi @westonruter

Thanks for your feedback. I corrected that mistake and restored the line.

* @param boolean $expected The expected result.
*/
public function test_schema_org_present( $script, $expected ) {
$page = '<html><head><script>%s</script></head><body>Test</body></html>';

This comment has been minimized.

@westonruter

westonruter Mar 8, 2018

Member

The <script> here is missing the type="application/ld+json".

This comment has been minimized.

@oscarssanchez

oscarssanchez Mar 9, 2018

Contributor

Very good observation, I added the type attribute on the new tests for ensure_required_markup()

public static function schema_org_present( DOMDocument $dom ) {
$head = $dom->getElementsByTagName( 'head' )->item( 0 );
foreach ( $head->getElementsByTagName( 'script' ) as $script ) {
if ( 1 === preg_match( '/{"@context":"https?:[\\\]\/[\\\]\/schema\.org/', $script->nodeValue ) ) {

This comment has been minimized.

@westonruter

westonruter Mar 8, 2018

Member

Before looking at the content of the element ($script->nodeValue) I think it should first check 'application/ld+json' === $script->getAttribute( 'type' ).

Also, I think the 1 === should be removed. As long as preg_match() returns a truthy value that's all we really care about.

This comment has been minimized.

@oscarssanchez

oscarssanchez Mar 9, 2018

Contributor

Removed the integer comparison and added your suggestion, indeed it makes it more precise.

* @param DOMDocument $dom Representation of the document.
* @return bool
*/
public static function schema_org_present( DOMDocument $dom ) {

This comment has been minimized.

@westonruter

westonruter Mar 8, 2018

Member

Let's rename this to is_schema_org_metadata_present. But actually, I think maybe the logic should just be inlined in ensure_required_markup like the other checks are done. True that it may be better design to break up the logic into separate methods, but since the existing logic is not in methods then I don't know if we should split out this one separately.

If you changed ensure_required_markup from protected to be public and then added the logic from schema_org_present to it, along with unit tests for ensure_required_markup, then that would be great. Later we may very well want to break up ensure_required_markup into smaller methods, but I think that would probably be done along with breaking up the AMP_Theme_Support class into separate classes.

This comment has been minimized.

@oscarssanchez

oscarssanchez Mar 9, 2018

Contributor

I added your suggestions.

I also did the test for ensure_required_markup, however, I only made tests for the schema.org script part. Could you use tests for the entire function?

This comment has been minimized.

@westonruter

westonruter Mar 9, 2018

Member

Yes, this method needs more tests, so if that's something you want to add that would be great.

@@ -829,6 +828,13 @@ protected static function ensure_required_markup( DOMDocument $dom ) {
) );
$head->insertBefore( $meta_viewport, $meta_charset->nextSibling );
}
if ( ! self::schema_org_present( $dom ) ) {

This comment has been minimized.

@westonruter

westonruter Mar 8, 2018

Member

As noted below, I think this logic should be inlined (for now) as it is done above for $meta_charset and $meta_viewport and $rel_canonical below.

@westonruter westonruter added this to the v0.7 milestone Mar 8, 2018

Prevent schema.org duplicates
Check if schema.org script is present.
Add schema.org only if it doesn't exist.
// Prevent schema.org duplicates.
$schema_org_script = null;
foreach ( $head->getElementsByTagName( 'script' ) as $script ) {
if ( 'application/ld+json' === $script->getAttribute( 'type' ) && preg_match( '/{"@context":"https?:[\\\]\/[\\\]\/schema\.org/', $script->nodeValue ) ) {

This comment has been minimized.

@westonruter

westonruter Mar 9, 2018

Member

The regular expression here could be brittle. There's no guarantee that the @context will be the first key, and there's no guarantee there wouldn't be whitespace between the strings. I think doing a simple false !== strpos( $script->nodeValue, 'schema.org' ) could be sufficient here.

This comment has been minimized.

@westonruter

westonruter Mar 9, 2018

Member

We may not even need to look at the value. Just checking if the type is application/ld+json may be just fine.

This comment has been minimized.

@oscarssanchez

oscarssanchez Mar 11, 2018

Contributor

Hi @westonruter

I opted to search for both the attribute and the 'schema.org' substring as you mentioned it should check for any JSON-LD script in the head which mentions 'schema.org' in any form.

$schema_org_script = null;
foreach ( $head->getElementsByTagName( 'script' ) as $script ) {
if ( 'application/ld+json' === $script->getAttribute( 'type' ) && preg_match( '/{"@context":"https?:[\\\]\/[\\\]\/schema\.org/', $script->nodeValue ) ) {
$schema_org_script = $script->nodeValue;

This comment has been minimized.

@westonruter

westonruter Mar 9, 2018

Member

Since the value isn't being used, it may be better to just store a flag like $has_schema_org_metadata = true.

This comment has been minimized.

@oscarssanchez

oscarssanchez Mar 11, 2018

Contributor

Thanks, it is true, I made the corresponding correction.

}
if ( ! $schema_org_script ) {
$script = $dom->createElement( 'script', wp_json_encode( amp_get_schemaorg_metadata() ) );
AMP_DOM_Utils::add_attributes_to_node( $script, array(

This comment has been minimized.

@westonruter

westonruter Mar 9, 2018

Member

Since you're using $dom->createElement() here then I think it makes more sense to just do $script->setAttribute( 'type', 'application/ld+json' )

This comment has been minimized.

@oscarssanchez

oscarssanchez Mar 11, 2018

Contributor

Also true. Corrected it.

$dom = new DOMDocument();
$dom->loadHTML( $page );
AMP_Theme_Support::ensure_required_markup( $dom );
$this->assertEquals( substr_count( $dom->saveHTML(), $schema_test_value ), 1 );

This comment has been minimized.

@westonruter

westonruter Mar 9, 2018

Member

Also should add a test for when JSON is output but the slashes are not escaped.

This comment has been minimized.

@westonruter

westonruter Mar 9, 2018

Member

And add a test for when @context is not the first key.

This comment has been minimized.

@oscarssanchez

oscarssanchez Mar 11, 2018

Contributor

I made tests for when @ anothercontext is the first key and has schema.org present, I hope I understood correctly.

@westonruter

This comment has been minimized.

Copy link
Member

westonruter commented Mar 9, 2018

One more higher-level question: when I first raised this issue I was thinking that there could only be one such metadata script in the document. However, apparently it's fine to have multiple: https://stackoverflow.com/questions/30723531/best-json-ld-practices-using-multiple-script-elements

The main concern here for the AMP plugin is to allow for other plugins to take control of the generation of the Schema.org metadata instead of the defaults that the AMP plugin outputs, but then to have the AMP plugin output a good fallback when others don't do so.

I'm feeling like if there is any JSON-LD script in the head (which mentions schema.org) then we can take this as the signal to just skip outputting the default metadata from the plugin.

$schema_org_script = null;
foreach ( $head->getElementsByTagName( 'script' ) as $script ) {
if ( 'application/ld+json' === $script->getAttribute( 'type' ) && preg_match( '/{"@context":"https?:[\\\]\/[\\\]\/schema\.org/', $script->nodeValue ) ) {
$schema_org_script = $script->nodeValue;

This comment has been minimized.

@westonruter

westonruter Mar 9, 2018

Member

Why can't this do json_decode( $script->nodeValue, false ) and check if wp_parse_url( $script['@context'], PHP_URL_HOST ) === 'schema.org?

This comment has been minimized.

@oscarssanchez

oscarssanchez Mar 11, 2018

Contributor

Hi @westonruter,

Trying to use wp_parse_url with json_decode( $script->nodeValue, false ) breaks the site, but it works fine when $script->nodeValue is converted to an array. The logic however stops working.

I've found that if you comment out these lines in class-amp-dom-utils.php, the logic works again:

	$placeholders = self::get_mustache_tag_placeholders();
	$document     = str_replace(
		array_keys( $placeholders ),
		array_values( $placeholders ),
		$document
	);

I opted to keep your previous suggestions since i don't know how important are those lines.

This comment has been minimized.

@westonruter

westonruter Mar 13, 2018

Member

OK, I think this makes sense. If I have some JSON that looks like:

{"foo": {"bar":}}

The last }} there is going to get replaced with a placeholder (since this token is returned by get_mustache_tag_placeholders) before parsing the DOM, to then get put back in AMP_DOM_Utils::get_content_from_dom_node(). That means we cannot parse JSON from the parsed DOM currently, which is not ideal. To fix this we'd need to limit the replacements to only happen inside of href and src attributes only, using something like this:

diff --git a/includes/utils/class-amp-dom-utils.php b/includes/utils/class-amp-dom-utils.php
index 5288c04..f20fba1 100644
--- a/includes/utils/class-amp-dom-utils.php
+++ b/includes/utils/class-amp-dom-utils.php
@@ -77,9 +77,15 @@ class AMP_DOM_Utils {
 		 * elements since it is faster and wouldn't change the outcome.
 		 */
 		$placeholders = self::get_mustache_tag_placeholders();
-		$document     = str_replace(
-			array_keys( $placeholders ),
-			array_values( $placeholders ),
+		$document     = preg_replace_callback(
+			'#\s(src|href)=("[^"]+"|\'[^\']+\')#s',
+			function( $matches ) use ( $placeholders ) {
+				return str_replace(
+					array_keys( $placeholders ),
+					array_values( $placeholders ),
+					$matches[0]
+				);
+			},
 			$document
 		);

That would probably be something good to add here to ensure we can read JSON out of the DOM.

This comment has been minimized.

@westonruter

westonruter Mar 14, 2018

Member

I've merged the fix for this in #1019 via 4a10ee4

@oscarssanchez oscarssanchez force-pushed the oscarssanchez:add/962-prevent-schema-duplicates branch from 8b96d77 to c66bfc4 Mar 11, 2018

oscarssanchez and others added some commits Mar 11, 2018

Schema.org prevent duplicates
Logic Refactoring

@westonruter westonruter merged commit ab0ae8a into ampproject:develop Mar 13, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment