-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added host validation for urls #983
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. A few changes needed.
Also, doesn't look like the unit test assertions are working: https://travis-ci.org/Automattic/amp-wp/jobs/347682592#L279
'<a class="foo" href="http://foo bar">value</a>', | ||
'<a class="foo" href="">value</a>', | ||
), | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add test cases for examples found in #977 (comment)
@@ -28,6 +28,7 @@ class AMP_Allowed_Tags_Generated { | |||
'value_url' => array( | |||
'allow_empty' => true, | |||
'allow_relative' => true, | |||
'valid_host' => true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't going to work since this file is generated from the spec via amphtml-update.py
, so your change here will be lost.
* If given attribute's value is a URL with a host, the host must | ||
* be valid | ||
*/ | ||
if ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ][ AMP_Rule_Spec::VALID_HOST ] ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of introducing AMP_Rule_Spec::VALID_HOST
here, which isn't actually part of the spec, I suggest you validate the host whenever isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ] )
. Should it not be validated for every attribute that is a URL value?
@@ -1122,6 +1138,40 @@ private function check_attr_spec_rule_value_regex_casei( $node, $attr_name, $att | |||
return AMP_Rule_Spec::NOT_APPLICABLE; | |||
} | |||
|
|||
/** | |||
* Check if attribute has a valid host value | |||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a @since 0.7
here
$urls_to_test = explode( ',', $attr_value ); | ||
|
||
foreach ( $urls_to_test as $url ) { | ||
$url_host = AMP_WP_Utils::parse_url( urldecode( $url ), PHP_URL_HOST ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in #876, AMP_WP_Utils
is obsolete and you can use wp_parse_url()
directly.
|
||
foreach ( $urls_to_test as $url ) { | ||
$url_host = AMP_WP_Utils::parse_url( urldecode( $url ), PHP_URL_HOST ); | ||
if ( $url_host && preg_match( '/[!"#$%&\'()*+,\/:;<=>?@[\]^`{|}~\s]/i', $url_host ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should think that a whitelist should be used rather than a blacklist here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to think like you, but the amp validator is doing it this way, looking for blacklisted chars, so i thought it would be better to follow the same approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is supposed to match everything matched by hostCharIsValid()
in https://github.com/ampproject/amphtml/blob/af1e3a550feeafd732226202b8d1f26dcefefa18/validator/engine/parse-url.js#L69-L103 ?
I'm getting the same failures locally. |
If I add <a class="foo" href="http://foo bar">value</a>
<a class="foo" href="mail to:foo@bar.com">value</a> I get out: <p><a class="foo" href="">value</a></p>
<p><a class="foo" href="mail%20to:foo@bar.com">value</a></p> So it seems the |
} | ||
|
||
// Check if the protocol contains invalid chars. | ||
$url_scheme = AMP_WP_Utils::parse_url( $url, PHP_URL_SCHEME ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use wp_parse_url()
instead of AMP_WP_Utils::parse_url()
here.
if ( $node->hasAttribute( $attr_name ) ) { | ||
$attr_value = $node->getAttribute( $attr_name ); | ||
$attr_value = preg_replace( '/\s*,\s*/', ',', $attr_value ); | ||
$urls_to_test = explode( ',', $attr_value ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about the comma-separated list of URLs. Naturally this shouldn't be allowed in an href
or src
. What elements do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also combine these two lines into:
$urls_to_test = preg_split( '/\s*,\s*/', $attr_value );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an image with a srcset attribute would have a list of comma separated urls, so i thought it could be a good idea to check all of them
I have post content now with the same 2 examples you added, and i get out the expected results. But unit tests fails anyway |
if ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ] ) ) { | ||
if ( $node->hasAttribute( $attr_name ) ) { | ||
$attr_value = $node->getAttribute( $attr_name ); | ||
$urls_to_test = preg_split( '/\s*,\s*/', $attr_value ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since only srcset
has comma-separated list of URLs, maybe this should check of 'srcset' === $attr_name
for whether to split. Maybe that's overkill. My only concern would be if a non-srcset
URL happening to have a comma in it and that this would result in the attribute being invalid.
The alternative could be to do this:
$urls_to_test = preg_split( '#\s*,\s*(?=https?://)#', $attr_value );
That will ensure that it will only split where a comma is followed by another URL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see that check_attr_spec_rule_allowed_protocol
and check_attr_spec_rule_disallowed_relative
do simply comma-separated splits. So probably not something to worry about here then.
The tests may not be failing. The jobs are failing due to a PHPCS error:
|
* Fix phpcs yoda error. * Fix phpunit failures in AMP_Tag_And_Attribute_Sanitizer_Attr_Spec_Rules_Test.
'a_with_wrong_host' => array( | ||
'<a class="foo" href="http://foo bar">value</a>', | ||
'<a class="foo" href="">value</a>', | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should add a test here for scheme-less URLs, like: <a class="foo" href="//bad domain with a space.com/foo">value</a>
} | ||
|
||
// Check if the protocol contains invalid chars. | ||
$dots_pos = strpos( $url, ':' ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this not be using $scheme = wp_parse_url( $url, PHP_URL_SCHEME )
to obtain the schema instead of ':'? What if a URL contains a port, then this could match in the case of a scheme-less URL like //example.com:80/foo/
or (more likely) the URL contains a colon in the path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, you could omit PHP_URL_HOST
param above and then just check for $parsed_url['scheme']
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that mail to:foo@bar.com
won't be correctly parsed by wp_parse_url, the scheme is null.
If the url contains a port number the code will check that no invalid chars are being used before the port number which will pass if the url is correct.
$urls_to_test = preg_split( '/\s*,\s*/', $node->getAttribute( $attr_name ) ); | ||
foreach ( $urls_to_test as $url ) { | ||
// Check if the host contains invalid chars. | ||
$url_host = wp_parse_url( urldecode( $url ), PHP_URL_HOST ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good use of urldecode()
here. If I have a URL like https://%65%78%61%6d%70%6c%65%2e%63%6f%6d/
then this should be valid in AMP, as it is just https://example.com/
👍
Fixes #977.