diff --git a/docs/guides/architecture_overview.mdx b/docs/guides/architecture_overview.mdx index 81243174b8..0f1b235b60 100644 --- a/docs/guides/architecture_overview.mdx +++ b/docs/guides/architecture_overview.mdx @@ -291,7 +291,7 @@ Request loaders provide a subset of `RequestQue - `RequestLoader` - Base interface for read-only access to a stream of requests, with capabilities like fetching the next request, marking as handled, and status checking. - `RequestList` - Lightweight in-memory implementation of `RequestLoader` for managing static lists of URLs. -- `SitemapRequestLoader` - Specialized loader for reading URLs from XML sitemaps with filtering capabilities. +- `SitemapRequestLoader` - A specialized loader that reads URLs from XML and plain-text sitemaps following the [Sitemaps protocol](https://www.sitemaps.org/protocol.html) with filtering capabilities. ### Request managers diff --git a/docs/guides/request_loaders.mdx b/docs/guides/request_loaders.mdx index bfef65a411..2c5607c8ff 100644 --- a/docs/guides/request_loaders.mdx +++ b/docs/guides/request_loaders.mdx @@ -31,7 +31,7 @@ The [`request_loaders`](https://github.com/apify/crawlee-python/tree/master/src/ And specific request loader implementations: - `RequestList`: A lightweight implementation for managing a static list of URLs. -- `SitemapRequestLoader`: A specialized loader that reads URLs from XML sitemaps with filtering capabilities. +- `SitemapRequestLoader`: A specialized loader that reads URLs from XML and plain-text sitemaps following the [Sitemaps protocol](https://www.sitemaps.org/protocol.html) with filtering capabilities. Below is a class diagram that illustrates the relationships between these components and the `RequestQueue`: @@ -130,7 +130,13 @@ To enable persistence, provide `persist_state_key` and optionally `persist_reque ### Sitemap request loader -The `SitemapRequestLoader` is a specialized request loader that reads URLs from XML sitemaps. It's particularly useful when you want to crawl a website systematically by following its sitemap structure. The loader supports filtering URLs using glob patterns and regular expressions, allowing you to include or exclude specific types of URLs. The `SitemapRequestLoader` provides streaming processing of sitemaps, ensuring efficient memory usage without loading the entire sitemap into memory. +The `SitemapRequestLoader` is a specialized request loader that reads URLs from sitemaps following the [Sitemaps protocol](https://www.sitemaps.org/protocol.html). It supports both XML and plain text sitemap formats. It's particularly useful when you want to crawl a website systematically by following its sitemap structure. + +:::note +The `SitemapRequestLoader` is designed specifically for sitemaps that follow the standard Sitemaps protocol. HTML pages containing links are not supported by this loader - those should be handled by regular crawlers using the `enqueue_links` functionality. +::: + +The loader supports filtering URLs using glob patterns and regular expressions, allowing you to include or exclude specific types of URLs. The `SitemapRequestLoader` provides streaming processing of sitemaps, ensuring efficient memory usage without loading the entire sitemap into memory. {SitemapExample} diff --git a/src/crawlee/request_loaders/_sitemap_request_loader.py b/src/crawlee/request_loaders/_sitemap_request_loader.py index c220e26402..afec4d4361 100644 --- a/src/crawlee/request_loaders/_sitemap_request_loader.py +++ b/src/crawlee/request_loaders/_sitemap_request_loader.py @@ -90,6 +90,11 @@ class SitemapRequestLoaderState(BaseModel): class SitemapRequestLoader(RequestLoader): """A request loader that reads URLs from sitemap(s). + The loader is designed to handle sitemaps that follow the format described in the Sitemaps protocol + (https://www.sitemaps.org/protocol.html). It supports both XML and plain text sitemap formats. + Note that HTML pages containing links are not supported - those should be handled by regular crawlers + and the `enqueue_links` functionality. + The loader fetches and parses sitemaps in the background, allowing crawling to start before all URLs are loaded. It supports filtering URLs using glob and regex patterns.