fix: escape XML special characters in sitemap generation#505
Conversation
sitemapToXml() interpolated user-supplied values (URLs, video titles, descriptions, uploader names) directly into XML without escaping. URLs containing & (extremely common with query params) produced invalid XML that search engines reject silently. Add escapeXml() helper that handles all five XML special characters (&, <, >, ", ') and apply it to every user-supplied value in the sitemap serializer. This intentionally diverges from Next.js's resolveSitemap() which also raw-interpolates, producing malformed XML.
commit: |
|
/bigbonk review |
There was a problem hiding this comment.
Solid fix. Unescaped & in sitemap URLs is a real-world problem that silently breaks XML parsing by search engines. The escapeXml() helper is correct (importantly, & is replaced first to avoid double-encoding), the coverage is thorough, and the intentional divergence from Next.js is well-documented in both code comments and the test name.
All 54 metadata-routes tests pass.
Two minor observations below — neither blocks merge.
| video.tag && `<video:tag>${escapeXml(String(video.tag))}</video:tag>`, | ||
| video.rating && `<video:rating>${video.rating}</video:rating>`, | ||
| video.expiration_date && | ||
| `<video:expiration_date>${video.expiration_date}</video:expiration_date>`, |
There was a problem hiding this comment.
Nit: expiration_date and publication_date are typed string | Date. When they're strings, they're interpolated raw here. In practice these will always be ISO date strings (no XML-special chars), but for consistency with the rest of the function, you could escapeXml(String(...)) them too. Not a real risk — just a completeness note.
Same applies to changeFrequency (line 270) and lastModified string values (line 267), though those are even less likely to contain special characters since changeFrequency is a union type and lastModified goes through serializeDate.
| expect(xml).toContain('<video:title>Fish & "Chips"</video:title>'); | ||
| expect(xml).toContain("<video:description>Tasty <b>meal</b></video:description>"); | ||
| expect(xml).toContain("<loc>https://example.com?a=1&b=2</loc>"); | ||
| expect(xml).toContain('href="https://example.com/fr?a=1&b=2"'); |
There was a problem hiding this comment.
Nice that the test uses 'fr"CA' as a hreflang key to cover quote escaping in attributes. Consider adding an assertion that the hreflang attribute value itself is escaped:
| expect(xml).toContain('href="https://example.com/fr?a=1&b=2"'); | |
| expect(xml).toContain('hreflang="fr"CA"'); | |
| expect(xml).toContain('href="https://example.com/fr?a=1&b=2"'); |
Currently only the href is asserted, so the quote-in-hreflang escaping is tested implicitly (by the well-formedness check at line 416) but not explicitly.
|
Review posted. Here's a summary: Approved the PR with two minor non-blocking observations:
The core fix is correct: |
Summary
sitemapToXml()interpolated user-supplied values directly into XML without escaping&(extremely common with query params like?q=a&b=2) produced invalid XML<,>,", or&also broke the XML structureFix
Add
escapeXml()helper that handles all five XML special characters (&<>"') and apply it to every user-supplied string value in the sitemap serializer: URLs, alternate hrefs, hreflang values, image locs, video titles/descriptions/tags/URLs, uploader names/info, restriction/platform attributes.Numeric fields (duration, view_count, rating) and controlled enums (family_friendly, live) are left unescaped since they can't contain XML-special characters.
This intentionally diverges from Next.js's
resolveSitemap()which also raw-interpolates. The existing parity test "matches Next's raw interpolation for XML-sensitive values" has been updated to assert correct XML escaping instead.Test plan
&produce&in<loc>,<image:loc>, andhrefattributes&,<,>,"are properly escaped