Skip to content

Conversation

@marevol
Copy link
Contributor

@marevol marevol commented Jan 24, 2026

Summary

  • Deduplicate anchor URLs extracted by FessXpathTransformer#getAnchorList using LinkedHashSet to prevent the crawler from processing duplicate URLs found across multiple anchor tags on the same page.

Changes Made

  • Replaced ArrayList with LinkedHashSet in getAnchorList to collect unique URLs while preserving first-occurrence order.
  • Added comprehensive unit tests covering:
    • No duplicates (baseline behavior)
    • Duplicates from the same tag type (<a>)
    • Duplicates from different tag types (<a>, <img>, <link>)
    • Insertion order preservation
    • Empty documents
    • All-duplicate scenarios

Testing

  • Added 7 new unit tests in FessXpathTransformerTest validating deduplication behavior across various scenarios.

🤖 Generated with Claude Code

Use LinkedHashSet instead of ArrayList in getAnchorList to eliminate
duplicate URLs while preserving insertion order. This prevents the
crawler from processing the same URL multiple times when it appears
in different anchor tags on the same page.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@marevol marevol self-assigned this Jan 24, 2026
@marevol marevol added this to the 15.5.0 milestone Jan 24, 2026
@marevol marevol merged commit 8b8bc7c into master Jan 24, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants