New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generateSitemap does not properly handle port-components in URLs #8848

Closed
binfalse opened this Issue Feb 7, 2018 · 5 comments

Comments

Projects
None yet
5 participants
@binfalse
Contributor

binfalse commented Feb 7, 2018

When urls in Contao require a port (such as https://example.com:1234, eg. when running at an unusual port), the urls in the sitemap become invalid...
The Automator sends every single url through rawurlencode, see

$strUrl = rawurlencode($strUrl);
(properly to prevent invalid XML?)
But rawurlencode will convert much more than necessary for XML documents... Therefore a few entities are recovered afterwards, see
$strUrl = str_replace(array('%2F', '%3F', '%3D', '%26', '%3A//'), array('/', '?', '=', '&', '://'), $strUrl);

However, colons are only recovered in combination with ://... Thus, port components will become something like https://example.com%3A1234

Thus, the resulting sitemap contains lot's of invalid urls...

binfalse added a commit to binfalse/contao-core that referenced this issue Feb 7, 2018

fix url-encoding in sitemap [see contao#8848]
to create the sitemap all contao-urls were `rawurlencode`d:
https://github.com/contao/core/blob/c7f0310ebd3f4e8b32a82f10f9ffa6827ab4b2a3/system/modules/core/library/Contao/Automator.php#L414

afterwards, some special entities were recovered (for example slashes and questionmarks):
https://github.com/contao/core/blob/c7f0310ebd3f4e8b32a82f10f9ffa6827ab4b2a3/system/modules/core/library/Contao/Automator.php#L415

however, colons were only recovered in combination with `://`...
thus, if the contao website is running at an unusual port (or behind a reverse proxy etc) all urls may include the port number, such as `:PORT`, which becomes `%3APORT` via `rawurlencode`, which in turn renders the sitemap-urls invalid...

i guess the urls need to be treated specially to obtain valid XML documents?

here i propose to skip the `rawurlencode` plus subsequent recovering of certain entities, and instead encode the urls XML-safe using:

    htmlspecialchars ($string, ENT_XML1, 'UTF-8');

see contao#8848
@binfalse

This comment has been minimized.

Contributor

binfalse commented Feb 7, 2018

PR #8849 uses htmlspecialchars instead of rawurlencode to create XML-safe versions of the URLs, which should fix the problem.
(However, I'm not sure if anything else is relying on rawurlencode-like entities somewhere?)

@aschempp

This comment has been minimized.

Contributor

aschempp commented Feb 8, 2018

@dmolineus

This comment has been minimized.

Contributor

dmolineus commented Feb 8, 2018

The requirements for url escaping are described here: https://www.sitemaps.org/protocol.html

@fritzmg

This comment has been minimized.

Contributor

fritzmg commented Feb 8, 2018

Only the path and query should be encoded probably, not the whole URI including the host and port.

@leofeyer

This comment has been minimized.

Member

leofeyer commented Feb 12, 2018

See #8849.

@leofeyer leofeyer closed this Feb 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment