Skip to content

feat(utils): add sitemapFilter option to parseSitemap#3557

Merged
janbuchar merged 2 commits intoapify:masterfrom
sbruinsje:feat/sitemap-filter
Apr 13, 2026
Merged

feat(utils): add sitemapFilter option to parseSitemap#3557
janbuchar merged 2 commits intoapify:masterfrom
sbruinsje:feat/sitemap-filter

Conversation

@sbruinsje
Copy link
Copy Markdown
Contributor

Motivation

When working with sitemap index files, parseSitemap currently follows all child sitemaps unconditionally. Sometimes sitemap indexes contain hundreds of child sitemaps, for instance, a child sitemap for every month going back 15 years (e.g., /articles-2010-01.xml through /articles-2026-03.xml). If you're only interested in the last 2 years of content, there's no way to skip the irrelevant ones without fetching and parsing all of them.

This PR adds a sitemapFilter callback option that lets you control which child sitemaps to skip, based on their URL.

Changes

  • Add optional sitemapFilter?: (sitemapUrl: string) => boolean to ParseSitemapOptions
  • When provided, each child sitemap URL discovered in a sitemap index is passed through the filter before being fetched. Return true to include, false to skip.
  • When not provided, behavior is unchanged all nested sitemaps are followed.
  • Skipped sitemaps are logged at debug level.

Example usage

// Only follow child sitemaps from the last 2 years
for await (const url of parseSitemap(sources, undefined, {
    sitemapFilter: (url) => /articles-202[4-5]/.test(url),
})) {
    // ...
}

Add an optional `sitemapFilter` callback to `ParseSitemapOptions` that
allows filtering which nested sitemaps from sitemap index files are
followed. This is useful when a sitemap index contains many irrelevant
child sitemaps (e.g., video sitemaps) that should be skipped.

Made-with: Cursor
@janbuchar
Copy link
Copy Markdown
Contributor

I haven't read the code yet, but do I understand it correctly that this new callback is invoked for sitemapindex urls? And that it's not used for actual page urls?

@janbuchar janbuchar self-requested a review April 8, 2026 10:09
@sbruinsje
Copy link
Copy Markdown
Contributor Author

Yes that is correct. Only for child sitemap urls in a sitemap index.

Copy link
Copy Markdown
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but let's think about the naming.

Comment thread packages/utils/src/internals/sitemap.ts Outdated
* Return `true` to include the sitemap, `false` to skip it.
* If not provided, all nested sitemaps are followed.
*/
sitemapFilter?: (sitemapUrl: string) => boolean;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's think about alternative names for this option. How about nestedSitemapFilter?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that makes sense. A nested sitemap filter is exactly what it is.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, let's go with that then

@sbruinsje sbruinsje requested a review from janbuchar April 11, 2026 19:34
@janbuchar janbuchar merged commit 1d4f6b9 into apify:master Apr 13, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants