Skip to content

fix: escape XML special characters in sitemap generation#505

Merged
james-elicx merged 1 commit intocloudflare:mainfrom
NathanDrake2406:fix/sitemap-xml-escaping
Mar 12, 2026
Merged

fix: escape XML special characters in sitemap generation#505
james-elicx merged 1 commit intocloudflare:mainfrom
NathanDrake2406:fix/sitemap-xml-escaping

Conversation

@NathanDrake2406
Copy link
Copy Markdown
Contributor

Summary

  • sitemapToXml() interpolated user-supplied values directly into XML without escaping
  • URLs with & (extremely common with query params like ?q=a&b=2) produced invalid XML
  • Video titles/descriptions with <, >, ", or & also broke the XML structure
  • Search engines silently reject malformed sitemaps, causing invisible SEO loss

Fix

Add escapeXml() helper that handles all five XML special characters (& < > " ') and apply it to every user-supplied string value in the sitemap serializer: URLs, alternate hrefs, hreflang values, image locs, video titles/descriptions/tags/URLs, uploader names/info, restriction/platform attributes.

Numeric fields (duration, view_count, rating) and controlled enums (family_friendly, live) are left unescaped since they can't contain XML-special characters.

This intentionally diverges from Next.js's resolveSitemap() which also raw-interpolates. The existing parity test "matches Next's raw interpolation for XML-sensitive values" has been updated to assert correct XML escaping instead.

Test plan

  • New test: URLs with & produce &amp; in <loc>, <image:loc>, and href attributes
  • New test: video fields with &, <, >, " are properly escaped
  • Updated existing XML-sensitive values test to expect correct escaping
  • All 54 metadata-routes tests pass
  • Sitemap-related shim tests pass
  • CI: full Vitest + Playwright suite

sitemapToXml() interpolated user-supplied values (URLs, video titles,
descriptions, uploader names) directly into XML without escaping. URLs
containing & (extremely common with query params) produced invalid XML
that search engines reject silently.

Add escapeXml() helper that handles all five XML special characters
(&, <, >, ", ') and apply it to every user-supplied value in the
sitemap serializer.

This intentionally diverges from Next.js's resolveSitemap() which also
raw-interpolates, producing malformed XML.
@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented Mar 12, 2026

Open in StackBlitz

npm i https://pkg.pr.new/vinext@505

commit: d0873b7

@james-elicx
Copy link
Copy Markdown
Collaborator

/bigbonk review

Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid fix. Unescaped & in sitemap URLs is a real-world problem that silently breaks XML parsing by search engines. The escapeXml() helper is correct (importantly, & is replaced first to avoid double-encoding), the coverage is thorough, and the intentional divergence from Next.js is well-documented in both code comments and the test name.

All 54 metadata-routes tests pass.

Two minor observations below — neither blocks merge.

video.tag && `<video:tag>${escapeXml(String(video.tag))}</video:tag>`,
video.rating && `<video:rating>${video.rating}</video:rating>`,
video.expiration_date &&
`<video:expiration_date>${video.expiration_date}</video:expiration_date>`,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: expiration_date and publication_date are typed string | Date. When they're strings, they're interpolated raw here. In practice these will always be ISO date strings (no XML-special chars), but for consistency with the rest of the function, you could escapeXml(String(...)) them too. Not a real risk — just a completeness note.

Same applies to changeFrequency (line 270) and lastModified string values (line 267), though those are even less likely to contain special characters since changeFrequency is a union type and lastModified goes through serializeDate.

expect(xml).toContain('<video:title>Fish & "Chips"</video:title>');
expect(xml).toContain("<video:description>Tasty <b>meal</b></video:description>");
expect(xml).toContain("<loc>https://example.com?a=1&amp;b=2</loc>");
expect(xml).toContain('href="https://example.com/fr?a=1&amp;b=2"');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice that the test uses 'fr"CA' as a hreflang key to cover quote escaping in attributes. Consider adding an assertion that the hreflang attribute value itself is escaped:

Suggested change
expect(xml).toContain('href="https://example.com/fr?a=1&amp;b=2"');
expect(xml).toContain('hreflang="fr&quot;CA"');
expect(xml).toContain('href="https://example.com/fr?a=1&amp;b=2"');

Currently only the href is asserted, so the quote-in-hreflang escaping is tested implicitly (by the well-formedness check at line 416) but not explicitly.

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Mar 12, 2026

Review posted. Here's a summary:

Approved the PR with two minor non-blocking observations:

  1. Date fields not escapedexpiration_date and publication_date are interpolated raw when they're strings. In practice these are always ISO date strings (no XML-special chars), but for consistency with the rest of the function they could be wrapped in escapeXml() too.

  2. Missing hreflang assertion — The test cleverly uses 'fr"CA' as a hreflang key to exercise quote escaping in attributes, but only asserts the href value is escaped. Adding an explicit assertion on the hreflang attribute value would make the coverage more visible.

The core fix is correct: escapeXml() replaces & first (avoiding double-encoding), covers all five XML special characters, and is applied to every user-supplied string. The intentional divergence from Next.js's raw interpolation is well-justified and documented. All 54 tests pass.

github run

@james-elicx james-elicx merged commit 7c425ef into cloudflare:main Mar 12, 2026
19 checks passed
@NathanDrake2406 NathanDrake2406 deleted the fix/sitemap-xml-escaping branch March 18, 2026 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants