diff --git a/README.md b/README.md index 5bdcab0ebe..cf26a294a7 100644 --- a/README.md +++ b/README.md @@ -1 +1,411 @@ -# Crawlee Python +

+ + + + Crawlee + + +
+ A web scraping and browser automation library +

+ +Crawlee covers your crawling and scraping end-to-end and **helps you build reliable scrapers. Fast.** + +Your crawlers will appear almost human-like and fly under the radar of modern bot protections even with +the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data, and +store it to disk or cloud while staying configurable to suit your project's needs. + +We also have a TypeScript implementation of the Crawlee, which you can explore and utilize for your projects. +Visit our GitHub repository for more information [Crawlee on GitHub](https://github.com/apify/crawlee). + +## Installation + +Crawlee is available as the [`crawlee`](https://pypi.org/project/crawlee/) PyPI package. + +``` +pip install crawlee +``` + +## Features + +- Unified interface for **HTTP and headless browser** crawling. +- Persistent **queue** for URLs to crawl (breadth & depth-first). +- Pluggable **storage** of both tabular data and files. +- Automatic **scaling** with available system resources. +- Integrated **proxy rotation** and session management. +- Configurable **request routing** - directing URLs to appropriate handlers. +- Robust **error handling**. +- Automatic **retries** when getting blocked. +- Written in Python with **type hints**, which means better DX and fewer bugs. + +## Introduction + +Crawlee covers your crawling and scraping end-to-end and helps you build reliable scrapers. Fast. + +Your crawlers will appear human-like and fly under the radar of modern bot protections even with the default +configuration. Crawlee gives you the tools to crawl the web for links, scrape data and persistently store it +in machine-readable formats, without having to worry about the technical details. And thanks to rich configuration +options, you can tweak almost any aspect of Crawlee to suit your project's needs if the default settings +don't cut it. + +### Crawlers + +Crawlee offers a framework for parallel web crawling through a variety of crawler classes, each designed to meet different crawling needs. + +#### HttpCrawler + +[`HttpCrawler`](https://github.com/apify/crawlee-py/tree/master/src/crawlee/http_crawler) provides a framework +for the parallel crawling of web pages using plain HTTP requests. The URLs to crawl are fed from a request provider. +It enables the recursive crawling of websites. The parsing of obtained HTML is the user's responsibility. + +Since `HttpCrawler` uses raw HTTP requests to download web pages, it is very fast and efficient on data +bandwidth. However, if the target website requires JavaScript to display the content, you might need to use +some browser crawler instead, e.g. `PlaywrightCrawler`, because it loads the pages using a full-featured headless Chrome browser. + +`HttpCrawler` downloads each URL using a plain HTTP request, obtain the response and then invokes the +user-provided request handler to extract page data. + +The source URLs are represented using the +[`Request`](https://github.com/apify/crawlee-py/blob/master/src/crawlee/models.py) objects that are fed from +[`RequestList`](https://github.com/apify/crawlee-py/blob/master/src/crawlee/storages/request_list.py) +or [`RequestQueue`](https://github.com/apify/crawlee-py/blob/master/src/crawlee/storages/request_queue.py) +instances provided by the request provider option. + +The crawler finishes when there are no more Request objects to crawl. + +If you want to parse data using [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) see +the `BeautifulSoupCrawler` section. + +Example usage: + +```python +import asyncio + +from crawlee.http_crawler import HttpCrawler, HttpCrawlingContext +from crawlee.storages import Dataset, RequestQueue + + +async def main() -> None: + # Open a default request queue and add requests to it + rq = await RequestQueue.open() + await rq.add_request('https://crawlee.dev') + + # Open a default dataset for storing results + dataset = await Dataset.open() + + # Create a HttpCrawler instance and provide a request provider + crawler = HttpCrawler(request_provider=rq) + + # Define a handler for processing requests + @crawler.router.default_handler + async def request_handler(context: HttpCrawlingContext) -> None: + # Crawler will provide a HttpCrawlingContext instance, from which you can access + # the request and response data + record = { + 'url': context.request.url, + 'status_code': context.http_response.status_code, + 'headers': dict(context.http_response.headers), + 'response': context.http_response.read().decode()[:1000], + } + # Extract the record and push it to the dataset + await dataset.push_data(record) + + # Run the crawler + await crawler.run() + + +if __name__ == '__main__': + asyncio.run(main()) +``` + +For further explanation of storages (dataset, request queue) see the storages section. + +#### BeautifulSoupCrawler + +[`BeautifulSoupCrawler`](https://github.com/apify/crawlee-py/tree/master/src/crawlee/beautifulsoup_crawler) extends +the `HttpCrawler`. It provides the same features and on top of that, it uses +[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) HTML parser. + +Same as for `HttpCrawler`, since `BeautifulSoupCrawler` uses raw HTTP requests to download web pages, +it is very fast and efficient on data bandwidth. However, if the target website requires JavaScript to display +the content, you might need to use `PlaywrightCrawler` instead, because it loads the pages using +a full-featured headless browser (Chrome, Firefox or others). + +`BeautifulSoupCrawler` downloads each URL using a plain HTTP request, parses the HTML content using BeautifulSoup +and then invokes the user-provided request handler to extract page data using an interface to the +parsed HTML DOM. + +Example usage: + +```python +import asyncio + +from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext +from crawlee.storages import Dataset, RequestQueue + + +async def main() -> None: + # Open a default request queue and add requests to it + rq = await RequestQueue.open() + await rq.add_request('https://crawlee.dev') + + # Open a default dataset for storing results + dataset = await Dataset.open() + + # Create a BeautifulSoupCrawler instance and provide a request provider + crawler = BeautifulSoupCrawler(request_provider=rq) + + # Define a handler for processing requests + @crawler.router.default_handler + async def request_handler(context: BeautifulSoupCrawlingContext) -> None: + # Crawler will provide a BeautifulSoupCrawlingContext instance, from which you can access + # the request and response data + record = { + 'title': context.soup.title.text if context.soup.title else '', + 'url': context.request.url, + } + # Extract the record and push it to the dataset + await dataset.push_data(record) + + # Run the crawler + await crawler.run() + + +if __name__ == '__main__': + asyncio.run(main()) +``` + +`BeautifulSoupCrawler` also provides a helper for enqueuing links in the currently crawling website. +See the following example with the updated request handler: + +```python + @crawler.router.default_handler + async def request_handler(context: BeautifulSoupCrawlingContext) -> None: + # Use enqueue links helper to enqueue all links from the page with the same domain + await context.enqueue_links(strategy=EnqueueStrategy.SAME_DOMAIN) + record = { + 'title': context.soup.title.text if context.soup.title else '', + 'url': context.request.url, + } + await dataset.push_data(record) +``` + +#### PlaywrightCrawler + +- TODO + +### Storages + +Crawlee introduces several result storage types that are useful for specific tasks. The storing of underlying data +is realized by the storage client. Currently, only a memory storage client is implemented. Using this, the data +are stored either in the memory or persisted on the disk. + +By default, the data are stored in the directory specified by the `CRAWLEE_STORAGE_DIR` environment variable. +With default `.storage/`. + +#### Dataset + +A [`Dataset`](https://github.com/apify/crawlee-py/blob/master/src/crawlee/storages/dataset.py) is a type +of storage mainly suitable for storing tabular data. + +Datasets are used to store structured data where each object stored has the same attributes, such as online store +products or real estate offers. The dataset can be imagined as a table, where each object is a row and its attributes +are columns. The dataset is an append-only storage - we can only add new records to it, but we cannot modify or +remove existing records. + +Each Crawlee project run is associated with a default dataset. Typically, it is used to store crawling results +specific to the crawler run. Its usage is optional. + +The data are persisted as follows: + +``` +{CRAWLEE_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json +``` + +The following code demonstrates the basic operations of the dataset: + +```python +import asyncio + +from crawlee.storages import Dataset + + +async def main() -> None: + # Open a default dataset + dataset = await Dataset.open() + + # Push a single record + await dataset.push_data({'key1': 'value1'}) + + # Get records from the dataset + data = await dataset.get_data() + print(f'Dataset data: {data.items}') # Dataset data: [{'key1': 'value1'}] + + # Open a named dataset + dataset_named = await Dataset.open('some-name') + + # Push multiple records + await dataset_named.push_data([{'key2': 'value2'}, {'key3': 'value3'}]) + + +if __name__ == '__main__': + asyncio.run(main()) +``` + + + +#### Key-value store + +The [`KeyValueStore`](https://github.com/apify/crawlee-py/blob/master/src/crawlee/storages/key_value_store.py) +is used for saving and reading data records or files. Each data record is represented by a unique +key and associated with a MIME content type. Key-value stores are ideal for saving screenshots of web pages, and PDFs +or to persist the state of crawlers. + +Each Crawlee project run is associated with a default key-value store. By convention, the project input and output are stored in the default key-value store under the `INPUT` and `OUTPUT` keys respectively. Typically, both input +and output are `JSON` files, although they could be any other format. + +The data are persisted as follows: + +``` +{CRAWLEE_STORAGE_DIR}/key_value_stores/{STORE_ID}/{KEY}.{EXT} +``` + +The following code demonstrates the basic operations of key-value stores: + +```python +import asyncio + +from crawlee.storages import KeyValueStore + + +async def main() -> None: + kvs = await KeyValueStore.open() # Open a default key-value store + + # Write the OUTPUT to the default key-value store + await kvs.set_value('OUTPUT', {'my_result': 123}) + + # Read the OUTPUT from the default key-value store + value = await kvs.get_value('OUTPUT') + print(f'Value of OUTPUT: {value}') # Value of OUTPUT: {'my_result': 123} + + # Open a named key-value store + kvs_named = await KeyValueStore.open('some-name') + + # Write a record to the named key-value store + await kvs_named.set_value('some-key', {'foo': 'bar'}) + + # Delete a record from the named key-value store + await kvs_named.set_value('some-key', None) + + +if __name__ == '__main__': + asyncio.run(main()) + +``` + + + +#### Request queue + +The [`RequestQueue`](https://github.com/apify/crawlee-py/blob/master/src/crawlee/storages/request_queue.py) +is a storage of URLs (requests) to crawl. The queue is used for the deep crawling of websites, +where we start with several URLs and then recursively follow links to other pages. The data structure supports both +breadth-first and depth-first crawling orders. + +Each Crawlee project run is associated with a default request queue. Typically, it is used to store URLs to crawl in the specific crawler run. Its usage is optional. + +The data are persisted as follows: + +``` +{CRAWLEE_STORAGE_DIR}/request_queues/{QUEUE_ID}/entries.json +``` + +The following code demonstrates the basic usage of the request queue: + +```python +import asyncio + +from crawlee.storages import RequestQueue + + +async def main() -> None: + # Open a default request queue + rq = await RequestQueue.open() + + # Add a single request + await rq.add_request('https://crawlee.dev') + + # Open a named request queue + rq_named = await RequestQueue.open('some-name') + + # Add multiple requests + await rq_named.add_requests_batched(['https://apify.com', 'https://example.com']) + + # Fetch the next request + request = await rq_named.fetch_next_request() + print(f'Next request: {request.url}') # Next request: https://apify.com + + +if __name__ == '__main__': + asyncio.run(main()) +``` + +For an example of usage of the request queue with a crawler see the `BeautifulSoupCrawler` example. + +### Session Management + +[​SessionPool](https://github.com/apify/crawlee-py/blob/master/src/crawlee/sessions/session_pool.py) +is a class that allows us to handle the rotation of proxy IP addresses along with cookies and other custom +settings in Crawlee. + +The main benefit of using a session pool is that we can filter out blocked or non-working proxies, +so our actor does not retry requests over known blocked/non-working proxies. Another benefit of using +the session pool is that we can store information tied tightly to an IP address, such as cookies, auth tokens, +and particular headers. Having our cookies and other identifiers used only with a specific IP will reduce +the chance of being blocked. The last but not least benefit is the even rotation of IP addresses - the session +pool picks the session randomly, which should prevent burning out a small pool of available IPs. + +To use a default session pool with automatic session rotation use the `use_session_pool` option for the crawler. + +```python +from crawlee.http_crawler import HttpCrawler + +crawler = HttpCrawler(use_session_pool=True) +``` + +If you want to configure your own session pool, instantiate it and provide it directly to the crawler. + +```python +from crawlee.http_crawler import HttpCrawler +from crawlee.sessions import SessionPool + +# Use dict as args for new sessions +session_pool_v1 = SessionPool( + max_pool_size=10, + create_session_settings = {'max_age': timedelta(minutes=10)}, +) + +# Use lambda creation function for new sessions +session_pool_v2 = SessionPool( + max_pool_size=10, + create_session_function=lambda _: Session(max_age=timedelta(minutes=10)), +) + +crawler = HttpCrawler(session_pool=session_pool_v1, use_session_pool=True) +``` + +## Running on the Apify platform + +Crawlee is open-source and runs anywhere, but since it's developed by [Apify](https://apify.com), it's easy to set up on the Apify platform and run in the cloud. Visit the [Apify SDK website](https://sdk.apify.com) to learn more about deploying Crawlee to the Apify platform. + +## Support + +If you find any bug or issue with Crawlee, please [submit an issue on GitHub](https://github.com/apify/crawlee-py/issues). For questions, you can ask on [Stack Overflow](https://stackoverflow.com/questions/tagged/apify), in GitHub Discussions or you can join our [Discord server](https://discord.com/invite/jyEM2PRvMU). + +## Contributing + +Your code contributions are welcome, and you'll be praised for eternity! If you have any ideas for improvements, either submit an issue or create a pull request. For contribution guidelines and the code of conduct, see [CONTRIBUTING.md](https://github.com/apify/crawlee-py/blob/master/CONTRIBUTING.md). + +## License + +This project is licensed under the Apache License 2.0 - see the [LICENSE.md](https://github.com/apify/crawlee-py/blob/master/LICENSE.md) file for details. diff --git a/src/crawlee/basic_crawler/basic_crawler.py b/src/crawlee/basic_crawler/basic_crawler.py index 8b36251811..7dfd5fcfd4 100644 --- a/src/crawlee/basic_crawler/basic_crawler.py +++ b/src/crawlee/basic_crawler/basic_crawler.py @@ -161,7 +161,7 @@ def __init__( ) self._use_session_pool = use_session_pool - self._session_pool: SessionPool = session_pool or SessionPool() + self._session_pool = session_pool or SessionPool() self._retry_on_blocked = retry_on_blocked diff --git a/src/crawlee/http_crawler/__init__.py b/src/crawlee/http_crawler/__init__.py index 5ac80ed71a..18256a3e2d 100644 --- a/src/crawlee/http_crawler/__init__.py +++ b/src/crawlee/http_crawler/__init__.py @@ -1 +1 @@ -from .http_crawler import HttpCrawler +from .http_crawler import HttpCrawler, HttpCrawlingContext diff --git a/src/crawlee/models.py b/src/crawlee/models.py index 5a66566327..a184a949b7 100644 --- a/src/crawlee/models.py +++ b/src/crawlee/models.py @@ -107,7 +107,9 @@ def from_url( @classmethod def from_base_request_data(cls, base_request_data: BaseRequestData, *, id: str | None = None) -> Self: """Create a complete Request object based on a BaseRequestData instance.""" - return cls(**base_request_data.model_dump(), id=id or unique_key_to_request_id(base_request_data.unique_key)) + kwargs = base_request_data.model_dump() + kwargs['id'] = id or unique_key_to_request_id(base_request_data.unique_key) + return cls(**kwargs) @property def label(self) -> str | None: diff --git a/src/crawlee/storages/request_list.py b/src/crawlee/storages/request_list.py index 0f36534738..07605ba737 100644 --- a/src/crawlee/storages/request_list.py +++ b/src/crawlee/storages/request_list.py @@ -2,6 +2,7 @@ from collections import deque from datetime import timedelta +from typing import Sequence from typing_extensions import override @@ -76,12 +77,20 @@ async def get_handled_count(self) -> int: @override async def add_requests_batched( self, - requests: list[BaseRequestData | Request], + requests: Sequence[BaseRequestData | Request | str], *, batch_size: int = 1000, wait_for_all_requests_to_be_added: bool = False, wait_time_between_batches: timedelta = timedelta(seconds=1), ) -> None: - self._sources.extend( - request if isinstance(request, Request) else Request.from_base_request_data(request) for request in requests - ) + batch = [] + + for request in requests: + if isinstance(request, Request): + batch.append(request) + elif isinstance(request, BaseRequestData): + batch.append(Request.from_base_request_data(request)) + else: + batch.append(Request.from_url(request)) + + self._sources.extend(batch) diff --git a/src/crawlee/storages/request_provider.py b/src/crawlee/storages/request_provider.py index b2d49620a0..fdbb01c2e8 100644 --- a/src/crawlee/storages/request_provider.py +++ b/src/crawlee/storages/request_provider.py @@ -2,7 +2,7 @@ from abc import ABC, abstractmethod from datetime import timedelta -from typing import TYPE_CHECKING +from typing import TYPE_CHECKING, Sequence if TYPE_CHECKING: from crawlee.models import BaseRequestData, Request, RequestQueueOperationInfo @@ -54,7 +54,7 @@ async def get_handled_count(self) -> int: @abstractmethod async def add_requests_batched( self, - requests: list[BaseRequestData | Request], + requests: Sequence[BaseRequestData | Request | str], *, batch_size: int = 1000, wait_for_all_requests_to_be_added: bool = False, diff --git a/src/crawlee/storages/request_queue.py b/src/crawlee/storages/request_queue.py index 0bce4eef8c..d0a0d396f7 100644 --- a/src/crawlee/storages/request_queue.py +++ b/src/crawlee/storages/request_queue.py @@ -4,7 +4,7 @@ from collections import OrderedDict from datetime import datetime, timedelta, timezone from logging import getLogger -from typing import TYPE_CHECKING +from typing import TYPE_CHECKING, Sequence from typing import OrderedDict as OrderedDictType from typing_extensions import override @@ -13,14 +13,14 @@ from crawlee._utils.lru_cache import LRUCache from crawlee._utils.requests import unique_key_to_request_id from crawlee.consts import REQUEST_QUEUE_LABEL -from crawlee.models import Request, RequestQueueHeadState, RequestQueueOperationInfo +from crawlee.models import BaseRequestData, Request, RequestQueueHeadState, RequestQueueOperationInfo from crawlee.storages.base_storage import BaseStorage from crawlee.storages.request_provider import RequestProvider if TYPE_CHECKING: from crawlee.base_storage_client import BaseStorageClient from crawlee.configuration import Configuration - from crawlee.models import BaseRequestData, BaseStorageMetadata + from crawlee.models import BaseStorageMetadata logger = getLogger(__name__) @@ -135,7 +135,7 @@ async def drop(self) -> None: async def add_request( self, - request: Request, + request: Request | BaseRequestData | str, *, forefront: bool = False, ) -> RequestQueueOperationInfo: @@ -169,6 +169,11 @@ async def add_request( - `wasAlreadyPresent` (bool): Indicates whether the request was already in the queue. - `wasAlreadyHandled` (bool): Indicates whether the request was already processed. """ + if isinstance(request, BaseRequestData): + request = Request.from_base_request_data(request) + elif isinstance(request, str): + request = Request.from_url(request) + self._last_activity = datetime.now(timezone.utc) cache_key = unique_key_to_request_id(request.unique_key) @@ -207,17 +212,14 @@ async def add_request( @override async def add_requests_batched( self, - requests: list[BaseRequestData | Request], + requests: Sequence[BaseRequestData | Request | str], *, batch_size: int = 1000, wait_for_all_requests_to_be_added: bool = False, wait_time_between_batches: timedelta = timedelta(seconds=1), ) -> None: for request in requests: - if isinstance(request, Request): - await self.add_request(request) - else: - await self.add_request(Request.from_base_request_data(request)) + await self.add_request(request) async def get_request(self, request_id: str) -> Request | None: """Retrieve a request from the queue.