Skip to content

fix(sitemap): Prevent sitemap parser from leaking metadata between <url> entries#1992

Merged
vdusek merged 1 commit into
apify:masterfrom
anxkhn:loop/crawlee-python__001
Jun 30, 2026
Merged

fix(sitemap): Prevent sitemap parser from leaking metadata between <url> entries#1992
vdusek merged 1 commit into
apify:masterfrom
anxkhn:loop/crawlee-python__001

Conversation

@anxkhn

@anxkhn anxkhn commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Description

The XML sitemap parser leaks per-entry metadata from a malformed <url> block into the next entry.

  • In _XMLSaxSitemapHandler.endElement (src/crawlee/_utils/sitemap.py), the emit and the reset of self._current_url were both gated behind if name == 'url' and 'loc' in self._current_url:.
  • _current_url accumulates lastmod / priority / changefreq as those child tags close. When a <url> block has metadata children but no (or an empty) <loc>, the closing </url> did not reset _current_url, so its stale metadata carried over and attached to the next parsed <url>.
  • Fix: reset _current_url on every closing </url>, while still only emitting an item when a loc is present. This discards a locless block's metadata instead of leaking it, and leaves well-formed sitemaps unchanged.
-        if name == 'url' and 'loc' in self._current_url:
-            self.items.append({'type': 'url', **self._current_url})
+        if name == 'url':
+            if 'loc' in self._current_url:
+                self.items.append({'type': 'url', **self._current_url})
             self._current_url = {}

Issues

  • No existing GitHub issue. This is a self-identified correctness bug found while reviewing the sitemap parser. Happy to open a tracking issue with the repro first if you'd prefer that.

Testing

  • Added test_xml_handler_discards_metadata_from_url_without_loc to tests/unit/_utils/test_sitemap.py. It drives the real ExpatParser + _XMLSaxSitemapHandler path (the same path _XmlSitemapParser uses in production), feeds a locless metadata <url> followed by a real <url><loc>...</loc></url>, and asserts the output is a single clean item with no leaked lastmod / priority / changefreq.
  • Verified fail-first: the new test fails on master (the metadata leaks) and passes with the fix.
  • uv run pytest tests/unit/_utils/test_sitemap.py tests/unit/request_loaders/test_sitemap_request_loader.py -> 101 passed. The request-loader suite exercises the same parser path.
  • uv run poe lint (ruff check + format) and uv run poe type-check (ty) both pass on the changed files.

Checklist

  • CI passed

@vdusek vdusek requested a review from Mantisus June 29, 2026 07:51

@Mantisus Mantisus left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @anxkhn. Thank you for your contribution.

Only one moment.

Comment thread tests/unit/_utils/test_sitemap.py Outdated
A `<url>` block carrying `lastmod`/`priority`/`changefreq` children but no
`<loc>` left `_XMLSaxSitemapHandler._current_url` uncleared on `</url>`, because
both the emit and the reset were gated on `'loc' in self._current_url`. The stale
metadata then leaked onto the next parsed `<url>`. Reset `_current_url` on every
closing `</url>` while still only emitting an item when a `loc` is present.
@anxkhn anxkhn force-pushed the loop/crawlee-python__001 branch from 2a2851e to 25c31b6 Compare June 29, 2026 17:53
@anxkhn anxkhn requested a review from Mantisus June 29, 2026 17:57

@Mantisus Mantisus left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vdusek vdusek changed the title fix: Discard metadata from sitemap <url> entries without <loc> fix: Stop sitemap metadata from leaking onto the wrong <url> entry Jun 30, 2026
@vdusek vdusek changed the title fix: Stop sitemap metadata from leaking onto the wrong <url> entry fix: Prevent sitemap parser from leaking metadata between <url> entries Jun 30, 2026

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vdusek vdusek changed the title fix: Prevent sitemap parser from leaking metadata between <url> entries fix(sitemap): Prevent sitemap parser from leaking metadata between <url> entries Jun 30, 2026
@vdusek vdusek merged commit a58687e into apify:master Jun 30, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants