fix(sitemap): Prevent sitemap parser from leaking metadata between <url> entries#1992
Merged
Merged
Conversation
A `<url>` block carrying `lastmod`/`priority`/`changefreq` children but no `<loc>` left `_XMLSaxSitemapHandler._current_url` uncleared on `</url>`, because both the emit and the reset were gated on `'loc' in self._current_url`. The stale metadata then leaked onto the next parsed `<url>`. Reset `_current_url` on every closing `</url>` while still only emitting an item when a `loc` is present.
2a2851e to
25c31b6
Compare
<url> entries without <loc>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The XML sitemap parser leaks per-entry metadata from a malformed
<url>block into the next entry._XMLSaxSitemapHandler.endElement(src/crawlee/_utils/sitemap.py), the emit and the reset ofself._current_urlwere both gated behindif name == 'url' and 'loc' in self._current_url:._current_urlaccumulateslastmod/priority/changefreqas those child tags close. When a<url>block has metadata children but no (or an empty)<loc>, the closing</url>did not reset_current_url, so its stale metadata carried over and attached to the next parsed<url>._current_urlon every closing</url>, while still only emitting an item when alocis present. This discards a locless block's metadata instead of leaking it, and leaves well-formed sitemaps unchanged.Issues
Testing
test_xml_handler_discards_metadata_from_url_without_loctotests/unit/_utils/test_sitemap.py. It drives the realExpatParser+_XMLSaxSitemapHandlerpath (the same path_XmlSitemapParseruses in production), feeds a locless metadata<url>followed by a real<url><loc>...</loc></url>, and asserts the output is a single clean item with no leakedlastmod/priority/changefreq.master(the metadata leaks) and passes with the fix.uv run pytest tests/unit/_utils/test_sitemap.py tests/unit/request_loaders/test_sitemap_request_loader.py-> 101 passed. The request-loader suite exercises the same parser path.uv run poe lint(ruff check + format) anduv run poe type-check(ty) both pass on the changed files.Checklist