Skip to content

fix: Reset private state correctly in sitemap parsers#1938

Merged
vdusek merged 2 commits into
masterfrom
fix/sitemap-parser-attribute-typos
Jun 4, 2026
Merged

fix: Reset private state correctly in sitemap parsers#1938
vdusek merged 2 commits into
masterfrom
fix/sitemap-parser-attribute-typos

Conversation

@vdusek
Copy link
Copy Markdown
Collaborator

@vdusek vdusek commented Jun 3, 2026

Description

Fixes two attribute-name typos in src/crawlee/_utils/sitemap.py where the underscore prefix was missing on assignment, so each statement silently created a new public attribute instead of resetting the intended private state:

  • _XMLSaxSitemapHandler.endElement set self.current_tag = None instead of self._current_tag = None. The handler therefore never left the "inside a tracked tag" state, so stray text between elements kept being appended to the buffer and a duplicate close tag could re-process stale buffer contents.
  • _TxtSitemapParser.flush set self.buffer = '' instead of self._buffer = ''. Reusing the parser after flush() concatenated the leftover URL with the next chunk, yielding corrupted URLs like https://b.com/https://c.com/.

Both fixes add the missing underscore. Regression tests cover the state reset in the XML handler and the buffer corruption in the TXT parser.

The XML SAX handler assigned `self.current_tag` instead of `self._current_tag`
and the TXT parser assigned `self.buffer` instead of `self._buffer`, so the
intended private state was never reset. This left stray inter-element text in
the XML buffer and leaked the last URL into subsequent chunks when a TXT
parser was reused after `flush()`.
@vdusek vdusek added t-tooling Issues with this label are in the ownership of the tooling team. adhoc Ad-hoc unplanned task added during the sprint. labels Jun 3, 2026
@vdusek vdusek self-assigned this Jun 3, 2026
@github-actions github-actions Bot added this to the 142nd sprint - Tooling team milestone Jun 3, 2026
@github-actions github-actions Bot added the tested Temporary label used only programatically for some analytics. label Jun 3, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.97%. Comparing base (8ab3f95) to head (b3ec968).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1938      +/-   ##
==========================================
+ Coverage   92.95%   92.97%   +0.01%     
==========================================
  Files         167      167              
  Lines       11708    11708              
==========================================
+ Hits        10883    10885       +2     
+ Misses        825      823       -2     
Flag Coverage Δ
unit 92.97% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vdusek vdusek requested a review from Mantisus June 3, 2026 17:57
@vdusek vdusek marked this pull request as ready for review June 3, 2026 17:57
Copy link
Copy Markdown
Collaborator

@Mantisus Mantisus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@vdusek
Copy link
Copy Markdown
Collaborator Author

vdusek commented Jun 4, 2026

@copilot resolve the merge conflicts in this pull request

Copy link
Copy Markdown
Contributor

Copilot AI commented Jun 4, 2026

@copilot resolve the merge conflicts in this pull request

Resolved the merge conflict and pushed the merge commit in b3ec968.

Copilot finished work on behalf of vdusek June 4, 2026 07:43
@vdusek vdusek merged commit 7db517a into master Jun 4, 2026
32 of 33 checks passed
@vdusek vdusek deleted the fix/sitemap-parser-attribute-typos branch June 4, 2026 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants