New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix url-encoding in sitemap [see #8848] #8849

Closed
wants to merge 1 commit into
base: 3.5
from

Conversation

Projects
None yet
3 participants
@binfalse
Contributor

binfalse commented Feb 7, 2018

to create the sitemap all contao-urls were rawurlencoded:

$strUrl = rawurlencode($strUrl);

afterwards, some special entities were recovered (for example slashes and questionmarks):

$strUrl = str_replace(array('%2F', '%3F', '%3D', '%26', '%3A//'), array('/', '?', '=', '&', '://'), $strUrl);

however, colons were only recovered in combination with ://...
thus, if the contao website is running at an unusual port (or behind a reverse proxy etc) all urls may include the port number, such as :PORT, which becomes %3APORT via rawurlencode, which in turn renders the sitemap-urls invalid...

i guess the urls need to be treated specially to obtain valid XML documents?

here i propose to skip the rawurlencode plus subsequent recovering of certain entities, and instead encode the urls XML-safe using:

htmlspecialchars ($string, ENT_XML1, 'UTF-8');

see #8848

fix url-encoding in sitemap [see #8848]
to create the sitemap all contao-urls were `rawurlencode`d:
https://github.com/contao/core/blob/c7f0310ebd3f4e8b32a82f10f9ffa6827ab4b2a3/system/modules/core/library/Contao/Automator.php#L414

afterwards, some special entities were recovered (for example slashes and questionmarks):
https://github.com/contao/core/blob/c7f0310ebd3f4e8b32a82f10f9ffa6827ab4b2a3/system/modules/core/library/Contao/Automator.php#L415

however, colons were only recovered in combination with `://`...
thus, if the contao website is running at an unusual port (or behind a reverse proxy etc) all urls may include the port number, such as `:PORT`, which becomes `%3APORT` via `rawurlencode`, which in turn renders the sitemap-urls invalid...

i guess the urls need to be treated specially to obtain valid XML documents?

here i propose to skip the `rawurlencode` plus subsequent recovering of certain entities, and instead encode the urls XML-safe using:

    htmlspecialchars ($string, ENT_XML1, 'UTF-8');

see #8848
@leofeyer

This comment has been minimized.

Member

leofeyer commented Feb 8, 2018

@ausi /cc

@ausi

This comment has been minimized.

Member

ausi commented Feb 8, 2018

i guess the urls need to be treated specially to obtain valid XML documents?

No, they need to be URL-encoded, see contao/core-bundle#1095 (comment)

@ausi

This comment has been minimized.

Member

ausi commented Feb 8, 2018

@binfalse can you please try if the following code works for your use case?

$strUrl = explode('/', $strUrl, 4);
$strUrl[3] = rawurlencode($strUrl[3]);
$strUrl[3] = str_replace(array('%2F', '%3F', '%3D', '%26', '%25'), array('/', '?', '=', '&', '%'), $strUrl[3]);
$strUrl = implode('/', $strUrl);
$strUrl = ampersand($strUrl, true);
@binfalse

This comment has been minimized.

Contributor

binfalse commented Feb 8, 2018

@ausi the snippet seems to be working fine (seems to produce the same result as htmlspecialchars ($string, ENT_XML1, 'UTF-8'); for me...?)

@leofeyer leofeyer added the defect label Feb 12, 2018

@leofeyer leofeyer added this to the 3.5.33 milestone Feb 12, 2018

@leofeyer

This comment has been minimized.

Member

leofeyer commented Feb 12, 2018

Here is the difference between our current code and htmlspecialchars($string, ENT_XML1, 'UTF-8'):

https://3v4l.org/9FfDi

@ausi Is it correct that [] do not need to be encoded? Then we could use htmlspecialchars here.

@leofeyer

This comment has been minimized.

Member

leofeyer commented Feb 12, 2018

Ok, so as briefly discussed with @ausi, using htmlspecialchars() is not a good solution, because it does not handle unicode characters. I have updated my test case accordingly:

https://3v4l.org/g7cZJ

So we should go for @ausi's solution (with an additional isset($strUrl[3]) check in case there is neither path nor query and no trailing slash).

And we should probably decode %5B and %5D as [ and ].

@leofeyer leofeyer modified the milestones: 3.5.33, 3.5.34 Feb 15, 2018

@leofeyer leofeyer self-assigned this Feb 16, 2018

leofeyer added a commit that referenced this pull request Feb 16, 2018

@leofeyer

This comment has been minimized.

Member

leofeyer commented Feb 16, 2018

Fixed in 36be049.

@leofeyer leofeyer closed this Feb 16, 2018

leofeyer added a commit to contao/core-bundle that referenced this pull request Feb 16, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment