Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added host validation for urls #983

Merged
merged 18 commits into from Mar 8, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 2 additions & 2 deletions .travis.yml
Expand Up @@ -43,9 +43,9 @@ cache:
matrix:
include:
- php: "5.3"
env: WP_VERSION=latest DEV_LIB_SKIP=phpcs
env: WP_VERSION=latest DEV_LIB_SKIP=composer,phpcs
- php: "5.3"
env: WP_VERSION=4.7 DEV_LIB_SKIP=phpcs
env: WP_VERSION=4.7 DEV_LIB_SKIP=composer,phpcs
- php: "5.4"
env: WP_VERSION=latest
- php: "5.4"
Expand Down
2 changes: 1 addition & 1 deletion dev-lib
73 changes: 61 additions & 12 deletions includes/sanitizers/class-amp-tag-and-attribute-sanitizer.php
Expand Up @@ -699,6 +699,19 @@ private function validate_attr_spec_list_for_node( $node, $attr_spec_list ) {
}
}

/*
* If given attribute's value is a URL with a host, the host must
* be valid
*/
if ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ] ) ) {
$result = $this->check_attr_spec_rule_valid_url( $node, $attr_name, $attr_spec_rule );
if ( AMP_Rule_Spec::PASS === $result ) {
$score++;
} elseif ( AMP_Rule_Spec::FAIL === $result ) {
return 0;
}
}

/*
* If the given attribute's value is *not* a relative path, and the rule's
* value is `false`, then pass.
Expand Down Expand Up @@ -877,6 +890,9 @@ private function delegated_sanitize_disallowed_attribute_values_in_node( $node,
} elseif ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ][ AMP_Rule_Spec::ALLOWED_PROTOCOL ] ) &&
AMP_Rule_Spec::FAIL === $this->check_attr_spec_rule_allowed_protocol( $node, $attr_name, $attr_spec_rule ) ) {
$should_remove_node = true;
} elseif ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ] ) &&
AMP_Rule_Spec::FAIL === $this->check_attr_spec_rule_valid_url( $node, $attr_name, $attr_spec_rule ) ) {
$should_remove_node = true;
} elseif ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ][ AMP_Rule_Spec::ALLOW_RELATIVE ] ) &&
AMP_Rule_Spec::FAIL === $this->check_attr_spec_rule_disallowed_relative( $node, $attr_name, $attr_spec_rule ) ) {
$should_remove_node = true;
Expand Down Expand Up @@ -1122,6 +1138,47 @@ private function check_attr_spec_rule_value_regex_casei( $node, $attr_name, $att
return AMP_Rule_Spec::NOT_APPLICABLE;
}

/**
* Check if attribute has a valid host value
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a @since 0.7 here

* @since 0.7
*
* @param DOMElement $node Node.
* @param string $attr_name Attribute name.
* @param array[]|string[] $attr_spec_rule Attribute spec rule.
*
* @return string:
* - AMP_Rule_Spec::PASS - $attr_name has a value that matches the rule.
* - AMP_Rule_Spec::FAIL - $attr_name has a value that does *not* match rule.
* - AMP_Rule_Spec::NOT_APPLICABLE - $attr_name does not exist or there
* is no rule for this attribute.
*/
private function check_attr_spec_rule_valid_url( $node, $attr_name, $attr_spec_rule ) {
if ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ] ) ) {
if ( $node->hasAttribute( $attr_name ) ) {
$urls_to_test = preg_split( '/\s*,\s*/', $node->getAttribute( $attr_name ) );
foreach ( $urls_to_test as $url ) {
$url = urldecode( $url );
// Check if the host contains invalid chars.
$url_host = wp_parse_url( $url, PHP_URL_HOST );
if ( $url_host && preg_match( '/[!"#$%&\'()*+,\/:;<=>?@[\]^`{|}~\s]/i', $url_host ) ) {
return AMP_Rule_Spec::FAIL;
}

// Check if the protocol contains invalid chars.
$dots_pos = strpos( $url, ':' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be using $scheme = wp_parse_url( $url, PHP_URL_SCHEME ) to obtain the schema instead of ':'? What if a URL contains a port, then this could match in the case of a scheme-less URL like //example.com:80/foo/ or (more likely) the URL contains a colon in the path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, you could omit PHP_URL_HOST param above and then just check for $parsed_url['scheme'].

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that mail to:foo@bar.com won't be correctly parsed by wp_parse_url, the scheme is null.
If the url contains a port number the code will check that no invalid chars are being used before the port number which will pass if the url is correct.

if ( false !== $dots_pos && preg_match( '/[!"#$%&\'()*+,\/:;<=>?@[\]^`{|}~\s]/i', substr( $url, 0, $dots_pos ) ) ) {
return AMP_Rule_Spec::FAIL;
}
}

return AMP_Rule_Spec::PASS;
}
}

return AMP_Rule_Spec::NOT_APPLICABLE;
}

/**
* Check if attribute has a protocol value rule determine if it matches.
*
Expand All @@ -1138,9 +1195,7 @@ private function check_attr_spec_rule_value_regex_casei( $node, $attr_name, $att
private function check_attr_spec_rule_allowed_protocol( $node, $attr_name, $attr_spec_rule ) {
if ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ][ AMP_Rule_Spec::ALLOWED_PROTOCOL ] ) ) {
if ( $node->hasAttribute( $attr_name ) ) {
$attr_value = $node->getAttribute( $attr_name );
$attr_value = preg_replace( '/\s*,\s*/', ',', $attr_value );
$urls_to_test = explode( ',', $attr_value );
$urls_to_test = preg_split( '/\s*,\s*/', $node->getAttribute( $attr_name ) );
foreach ( $urls_to_test as $url ) {
/*
* This seems to be an acceptable check since the AMP validator
Expand All @@ -1157,9 +1212,7 @@ private function check_attr_spec_rule_allowed_protocol( $node, $attr_name, $attr
} elseif ( isset( $attr_spec_rule[ AMP_Rule_Spec::ALTERNATIVE_NAMES ] ) ) {
foreach ( $attr_spec_rule[ AMP_Rule_Spec::ALTERNATIVE_NAMES ] as $alternative_name ) {
if ( $node->hasAttribute( $alternative_name ) ) {
$attr_value = $node->getAttribute( $alternative_name );
$attr_value = preg_replace( '/\s*,\s*/', ',', $attr_value );
$urls_to_test = explode( ',', $attr_value );
$urls_to_test = preg_split( '/\s*,\s*/', $node->getAttribute( $alternative_name ) );
foreach ( $urls_to_test as $url ) {
/*
* This seems to be an acceptable check since the AMP validator
Expand Down Expand Up @@ -1196,9 +1249,7 @@ private function check_attr_spec_rule_allowed_protocol( $node, $attr_name, $attr
private function check_attr_spec_rule_disallowed_relative( $node, $attr_name, $attr_spec_rule ) {
if ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ][ AMP_Rule_Spec::ALLOW_RELATIVE ] ) && ! ( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ][ AMP_Rule_Spec::ALLOW_RELATIVE ] ) ) {
if ( $node->hasAttribute( $attr_name ) ) {
$attr_value = $node->getAttribute( $attr_name );
$attr_value = preg_replace( '/\s*,\s*/', ',', $attr_value );
$urls_to_test = explode( ',', $attr_value );
$urls_to_test = preg_split( '/\s*,\s*/', $node->getAttribute( $attr_name ) );
foreach ( $urls_to_test as $url ) {
$parsed_url = AMP_WP_Utils::parse_url( $url );

Expand All @@ -1217,9 +1268,7 @@ private function check_attr_spec_rule_disallowed_relative( $node, $attr_name, $a
} elseif ( isset( $attr_spec_rule[ AMP_Rule_Spec::ALTERNATIVE_NAMES ] ) ) {
foreach ( $attr_spec_rule[ AMP_Rule_Spec::ALTERNATIVE_NAMES ] as $alternative_name ) {
if ( $node->hasAttribute( $alternative_name ) ) {
$attr_value = $node->getAttribute( $alternative_name );
$attr_value = preg_replace( '/\s*,\s*/', ',', $attr_value );
$urls_to_test = explode( ',', $attr_value );
$urls_to_test = preg_split( '/\s*,\s*/', $node->getAttribute( $alternative_name ) );
foreach ( $urls_to_test as $url ) {
$parsed_url = AMP_WP_Utils::parse_url( $url );
if ( empty( $parsed_url['scheme'] ) ) {
Expand Down
1 change: 1 addition & 0 deletions readme.md
Expand Up @@ -64,6 +64,7 @@ Follow along with or [contribute](https://github.com/Automattic/amp-wp/blob/deve
- Creation of AMP-related notifications, on entering invalid content in the 'classic' editor. See [#912](https://github.com/Automattic/amp-wp/pull/912/). Props kienstra, westonruter, ThierryA.
- Add output buffering, ensuring the entire page is valid AMP. See [#929](https://github.com/Automattic/amp-wp/pull/929), [#857](https://github.com/Automattic/amp-wp/pull/857). Props westonruter, ThierryA.
- Update the generated sanitizer file to the AMP spec, and simplify the file that generates it. See [#929](https://github.com/Automattic/amp-wp/pull/929). Props westonruter.
- Add validation of host names in URLs. See [#983](https://github.com/Automattic/amp-wp/pull/983). Props rubengonzalezmrf.

See [0.7 milestone](https://github.com/Automattic/amp-wp/milestone/6?closed=1).

Expand Down
1 change: 1 addition & 0 deletions readme.txt
Expand Up @@ -47,6 +47,7 @@ Follow along with or [contribute](https://github.com/Automattic/amp-wp/blob/deve
- Creation of AMP-related notifications, on entering invalid content in the 'classic' editor. See [#912](https://github.com/Automattic/amp-wp/pull/912/). Props kienstra, westonruter, ThierryA.
- Add output buffering, ensuring the entire page is valid AMP. See [#929](https://github.com/Automattic/amp-wp/pull/929), [#857](https://github.com/Automattic/amp-wp/pull/857). Props westonruter, ThierryA.
- Update the generated sanitizer file to the AMP spec, and simplify the file that generates it. See [#929](https://github.com/Automattic/amp-wp/pull/929). Props westonruter.
- Add validation of host names in URLs. See [#983](https://github.com/Automattic/amp-wp/pull/983). Props rubengonzalezmrf.

See [0.7 milestone](https://github.com/Automattic/amp-wp/milestone/6?closed=1).

Expand Down
Expand Up @@ -996,7 +996,7 @@ public function get_validate_attr_spec_list_for_node_data() {
),
),
),
1,
2,
),
'attributes_allow_relative_false_fail' => array(
array(
Expand Down Expand Up @@ -1024,7 +1024,7 @@ public function get_validate_attr_spec_list_for_node_data() {
),
),
),
1,
2,
),
'attributes_allow_empty_false_fail' => array(
array(
Expand Down
17 changes: 17 additions & 0 deletions tests/test-tag-and-attribute-sanitizer.php
Expand Up @@ -680,6 +680,23 @@ public function get_body_data() {
'<a class="foo" href="">value</a>',
),

'a_with_wrong_host' => array(
'<a class="foo" href="http://foo bar">value</a>',
'<a class="foo" href="">value</a>',
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should add a test here for scheme-less URLs, like: <a class="foo" href="//bad domain with a space.com/foo">value</a>

'a_with_encoded_host' => array(
'<a class="foo" href="http://%65%78%61%6d%70%6c%65%2e%63%6f%6d/foo/">value</a>',
null,
),
'a_with_wrong_schemeless_host' => array(
'<a class="foo" href="//bad domain with a space.com/foo">value</a>',
'<a class="foo" href="">value</a>',
),
'a_with_mail_host' => array(
'<a class="foo" href="mail to:foo@bar.com">value</a>',
'<a class="foo" href="">value</a>',
),

// font is removed so we should check that other elements are checked as well.
'font_with_other_bad_elements' => array(
'<font size="1">Headline</font><span style="color: blue">Span</span>',
Expand Down