Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added host validation for urls #983

Merged
merged 18 commits into from Mar 8, 2018

Conversation

Projects
None yet
2 participants
@rubengonzalezmrf
Copy link

commented Mar 1, 2018

Fixes #977.

Rubén

Rubén added some commits Mar 1, 2018

Rubén
Rubén
@westonruter
Copy link
Member

left a comment

Thanks for the PR. A few changes needed.

Also, doesn't look like the unit test assertions are working: https://travis-ci.org/Automattic/amp-wp/jobs/347682592#L279

'<a class="foo" href="http://foo bar">value</a>',
'<a class="foo" href="">value</a>',
),

This comment has been minimized.

Copy link
@westonruter

westonruter Mar 1, 2018

Member

Please add test cases for examples found in #977 (comment)

@@ -28,6 +28,7 @@ class AMP_Allowed_Tags_Generated {
'value_url' => array(
'allow_empty' => true,
'allow_relative' => true,
'valid_host' => true,

This comment has been minimized.

Copy link
@westonruter

westonruter Mar 1, 2018

Member

This isn't going to work since this file is generated from the spec via amphtml-update.py, so your change here will be lost.

* If given attribute's value is a URL with a host, the host must
* be valid
*/
if ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ][ AMP_Rule_Spec::VALID_HOST ] ) ) {

This comment has been minimized.

Copy link
@westonruter

westonruter Mar 1, 2018

Member

Instead of introducing AMP_Rule_Spec::VALID_HOST here, which isn't actually part of the spec, I suggest you validate the host whenever isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ] ). Should it not be validated for every attribute that is a URL value?

@@ -1122,6 +1138,40 @@ private function check_attr_spec_rule_value_regex_casei( $node, $attr_name, $att
return AMP_Rule_Spec::NOT_APPLICABLE;
}
/**
* Check if attribute has a valid host value
*

This comment has been minimized.

Copy link
@westonruter

westonruter Mar 1, 2018

Member

Please add a @since 0.7 here

$urls_to_test = explode( ',', $attr_value );
foreach ( $urls_to_test as $url ) {
$url_host = AMP_WP_Utils::parse_url( urldecode( $url ), PHP_URL_HOST );

This comment has been minimized.

Copy link
@westonruter

westonruter Mar 1, 2018

Member

As noted in #876, AMP_WP_Utils is obsolete and you can use wp_parse_url() directly.

foreach ( $urls_to_test as $url ) {
$url_host = AMP_WP_Utils::parse_url( urldecode( $url ), PHP_URL_HOST );
if ( $url_host && preg_match( '/[!"#$%&\'()*+,\/:;<=>?@[\]^`{|}~\s]/i', $url_host ) ) {

This comment has been minimized.

Copy link
@westonruter

westonruter Mar 1, 2018

Member

I should think that a whitelist should be used rather than a blacklist here.

This comment has been minimized.

Copy link
@rubengonzalezmrf

rubengonzalezmrf Mar 2, 2018

Author

I tend to think like you, but the amp validator is doing it this way, looking for blacklisted chars, so i thought it would be better to follow the same approach.

This comment has been minimized.

Copy link
@westonruter

Rubén added some commits Mar 2, 2018

Rubén
Rubén
Rubén
Rubén
Rubén
@westonruter

This comment has been minimized.

Copy link
Member

commented Mar 6, 2018

I'm getting the same failures locally.

@westonruter

This comment has been minimized.

Copy link
Member

commented Mar 6, 2018

If I add post_content that includes:

<a class="foo" href="http://foo bar">value</a>

<a class="foo" href="mail to:foo@bar.com">value</a>

I get out:

<p><a class="foo" href="">value</a></p>

<p><a class="foo" href="mail%20to:foo@bar.com">value</a></p>

So it seems the mail to check is failing.

}
// Check if the protocol contains invalid chars.
$url_scheme = AMP_WP_Utils::parse_url( $url, PHP_URL_SCHEME );

This comment has been minimized.

Copy link
@westonruter

westonruter Mar 6, 2018

Member

Use wp_parse_url() instead of AMP_WP_Utils::parse_url() here.

if ( $node->hasAttribute( $attr_name ) ) {
$attr_value = $node->getAttribute( $attr_name );
$attr_value = preg_replace( '/\s*,\s*/', ',', $attr_value );
$urls_to_test = explode( ',', $attr_value );

This comment has been minimized.

Copy link
@westonruter

westonruter Mar 6, 2018

Member

I'm curious about the comma-separated list of URLs. Naturally this shouldn't be allowed in an href or src. What elements do this?

This comment has been minimized.

Copy link
@westonruter

westonruter Mar 6, 2018

Member

You can also combine these two lines into:

$urls_to_test = preg_split( '/\s*,\s*/', $attr_value );

This comment has been minimized.

Copy link
@rubengonzalezmrf

rubengonzalezmrf Mar 7, 2018

Author

an image with a srcset attribute would have a list of comma separated urls, so i thought it could be a good idea to check all of them

Rubén
@rubengonzalezmrf

This comment has been minimized.

Copy link
Author

commented Mar 7, 2018

I have post content now with the same 2 examples you added, and i get out the expected results. But unit tests fails anyway

if ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ] ) ) {
if ( $node->hasAttribute( $attr_name ) ) {
$attr_value = $node->getAttribute( $attr_name );
$urls_to_test = preg_split( '/\s*,\s*/', $attr_value );

This comment has been minimized.

Copy link
@westonruter

westonruter Mar 7, 2018

Member

Since only srcset has comma-separated list of URLs, maybe this should check of 'srcset' === $attr_name for whether to split. Maybe that's overkill. My only concern would be if a non-srcset URL happening to have a comma in it and that this would result in the attribute being invalid.

The alternative could be to do this:

$urls_to_test = preg_split( '#\s*,\s*(?=https?://)#', $attr_value );

That will ensure that it will only split where a comma is followed by another URL.

This comment has been minimized.

Copy link
@westonruter

westonruter Mar 8, 2018

Member

Ah, I see that check_attr_spec_rule_allowed_protocol and check_attr_spec_rule_disallowed_relative do simply comma-separated splits. So probably not something to worry about here then.

@westonruter

This comment has been minimized.

Copy link
Member

commented Mar 8, 2018

But unit tests fails anyway

The tests may not be failing. The jobs are failing due to a PHPCS error:

includes/sanitizers/class-amp-tag-and-attribute-sanitizer.php:1171:36: error - Use Yoda Condition checks, you must. (WordPress.PHP.YodaConditions.NotYoda)

https://travis-ci.org/Automattic/amp-wp/jobs/350210562#L216

Simplify splitting a string by commas
* Fix phpcs yoda error.
* Fix phpunit failures in AMP_Tag_And_Attribute_Sanitizer_Attr_Spec_Rules_Test.
'a_with_wrong_host' => array(
'<a class="foo" href="http://foo bar">value</a>',
'<a class="foo" href="">value</a>',
),

This comment has been minimized.

Copy link
@westonruter

westonruter Mar 8, 2018

Member

Probably should add a test here for scheme-less URLs, like: <a class="foo" href="//bad domain with a space.com/foo">value</a>

}
// Check if the protocol contains invalid chars.
$dots_pos = strpos( $url, ':' );

This comment has been minimized.

Copy link
@westonruter

westonruter Mar 8, 2018

Member

Should this not be using $scheme = wp_parse_url( $url, PHP_URL_SCHEME ) to obtain the schema instead of ':'? What if a URL contains a port, then this could match in the case of a scheme-less URL like //example.com:80/foo/ or (more likely) the URL contains a colon in the path.

This comment has been minimized.

Copy link
@westonruter

westonruter Mar 8, 2018

Member

In this case, you could omit PHP_URL_HOST param above and then just check for $parsed_url['scheme'].

This comment has been minimized.

Copy link
@rubengonzalezmrf

rubengonzalezmrf Mar 8, 2018

Author

The point is that mail to:foo@bar.com won't be correctly parsed by wp_parse_url, the scheme is null.
If the url contains a port number the code will check that no invalid chars are being used before the port number which will pass if the url is correct.

$urls_to_test = preg_split( '/\s*,\s*/', $node->getAttribute( $attr_name ) );
foreach ( $urls_to_test as $url ) {
// Check if the host contains invalid chars.
$url_host = wp_parse_url( urldecode( $url ), PHP_URL_HOST );

This comment has been minimized.

Copy link
@westonruter

westonruter Mar 8, 2018

Member

Good use of urldecode() here. If I have a URL like https://%65%78%61%6d%70%6c%65%2e%63%6f%6d/ then this should be valid in AMP, as it is just https://example.com/ 👍

Rubén and others added some commits Mar 8, 2018

@westonruter westonruter added this to the v0.7 milestone Mar 8, 2018

@westonruter westonruter merged commit d0f4f01 into ampproject:develop Mar 8, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.