diff --git a/docs/docs/integrations/document_loaders/spider.ipynb b/docs/docs/integrations/document_loaders/spider.ipynb new file mode 100644 index 0000000000000..ac132724cb512 --- /dev/null +++ b/docs/docs/integrations/document_loaders/spider.ipynb @@ -0,0 +1,95 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Spider\n", + "[Spider](https://spider.cloud/) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md) and most affordable crawler and scraper that returns LLM-ready data.\n", + "\n", + "## Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pip install spider-client" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Usage\n", + "To use spider you need to have an API key from [spider.cloud](https://spider.cloud/)." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[Document(page_content='Spider - Fastest Web Crawler built for AI Agents and Large Language Models[Spider v1 Logo Spider ](/)The World\\'s Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"], \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/crawl\\', headers=headers, json=json_data)print(response.json())```Example ResponseScrape with no headaches----------* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM ResponsesThe Fastest Web Crawler----------* Powered by [spider-rs](https://github.com/spider-rs/spider)* Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* Cost effectiveScrape Anything with AI----------* Custom scripting browser* Custom data extraction* Data pipelines* Detailed insights* Advanced labeling[API](/docs/api) [Price](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', metadata={'description': 'Collect data rapidly from any website. Seamlessly scrape websites and get data tailored for LLM workloads.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 33743, 'keywords': None, 'pathname': '/', 'resource_type': 'html', 'title': 'Spider - Fastest Web Crawler built for AI Agents and Large Language Models', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/index.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'})]\n" + ] + } + ], + "source": [ + "from langchain_community.document_loaders import SpiderLoader\n", + "\n", + "loader = SpiderLoader(\n", + " api_key=\"YOUR_API_KEY\",\n", + " url=\"https://spider.cloud\",\n", + " mode=\"scrape\", # if no API key is provided it looks for SPIDER_API_KEY in env\n", + ")\n", + "\n", + "data = loader.load()\n", + "print(data)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Modes\n", + "- `scrape`: Default mode that scrapes a single URL\n", + "- `crawl`: Crawl all subpages of the domain url provided" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Crawler options\n", + "The `params` parameter is a dictionary that can be passed to the loader. See the [Spider documentation](https://spider.cloud/docs/api) to see all available parameters" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.1" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs/docs/integrations/document_loaders/web_base.ipynb b/docs/docs/integrations/document_loaders/web_base.ipynb index 5362d8d7021a2..3bd81baf7d0dd 100644 --- a/docs/docs/integrations/document_loaders/web_base.ipynb +++ b/docs/docs/integrations/document_loaders/web_base.ipynb @@ -9,7 +9,7 @@ "\n", "This covers how to use `WebBaseLoader` to load all text from `HTML` webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as `IMSDbLoader`, `AZLyricsLoader`, and `CollegeConfidentialLoader`. \n", "\n", - "If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader`.\n" + "If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader` or the faster option `SpiderLoader`.\n" ] }, { diff --git a/docs/docs/modules/data_connection/document_loaders/html.mdx b/docs/docs/modules/data_connection/document_loaders/html.mdx index a6d204bd21e44..b995c2c6d0e94 100644 --- a/docs/docs/modules/data_connection/document_loaders/html.mdx +++ b/docs/docs/modules/data_connection/document_loaders/html.mdx @@ -55,6 +55,32 @@ data +## Loading HTML with SpiderLoader + +[Spider](https://spider.cloud/?ref=langchain) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md#benchmark-results) crawler. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI. + +Spider allows you to use high performance proxies to prevent detection, caches AI actions, webhooks for crawling status, scheduled crawls etc... + +## Prerequisite + +You need to have a Spider api key to use this loader. You can get one on [spider.cloud](https://spider.cloud). + +```python +%pip install --upgrade --quiet langchain langchain-community spider-client +``` +```python +from langchain_community.document_loaders import SpiderLoader + +loader = SpiderLoader( + api_key="YOUR_API_KEY", url="https://spider.cloud", mode="crawl" +) + +data = loader.load() +``` + +For guides and documentation, visit [Spider](https://spider.cloud/docs/api) + + ## Loading HTML with FireCrawlLoader [FireCrawl](https://firecrawl.dev/?ref=langchain) crawls and convert any website into markdown. It crawls all accessible subpages and give you clean markdown and metadata for each. diff --git a/libs/community/langchain_community/document_loaders/__init__.py b/libs/community/langchain_community/document_loaders/__init__.py index 0f0d511000fad..171a2b9eabe79 100644 --- a/libs/community/langchain_community/document_loaders/__init__.py +++ b/libs/community/langchain_community/document_loaders/__init__.py @@ -14,6 +14,7 @@ Document, TextSplitter """ + import importlib from typing import TYPE_CHECKING, Any @@ -409,6 +410,9 @@ from langchain_community.document_loaders.snowflake_loader import ( SnowflakeLoader, # noqa: F401 ) + from langchain_community.document_loaders.spider import ( + SpiderLoader, # noqa: F401 + ) from langchain_community.document_loaders.spreedly import ( SpreedlyLoader, # noqa: F401 ) @@ -647,6 +651,7 @@ "SitemapLoader", "SlackDirectoryLoader", "SnowflakeLoader", + "SpiderLoader", "SpreedlyLoader", "StripeLoader", "SurrealDBLoader", @@ -836,6 +841,7 @@ "SitemapLoader": "langchain_community.document_loaders.sitemap", "SlackDirectoryLoader": "langchain_community.document_loaders.slack_directory", "SnowflakeLoader": "langchain_community.document_loaders.snowflake_loader", + "SpiderLoader": "langchain_community.document_loaders.spider", "SpreedlyLoader": "langchain_community.document_loaders.spreedly", "StripeLoader": "langchain_community.document_loaders.stripe", "SurrealDBLoader": "langchain_community.document_loaders.surrealdb", diff --git a/libs/community/langchain_community/document_loaders/spider.py b/libs/community/langchain_community/document_loaders/spider.py new file mode 100644 index 0000000000000..23d6978165b33 --- /dev/null +++ b/libs/community/langchain_community/document_loaders/spider.py @@ -0,0 +1,94 @@ +from typing import Iterator, Literal, Optional + +from langchain_core.document_loaders import BaseLoader +from langchain_core.documents import Document +from langchain_core.utils import get_from_env + + +class SpiderLoader(BaseLoader): + """Load web pages as Documents using Spider AI. + + Must have the Python package `spider-client` installed and a Spider API key. + See https://spider.cloud for more. + """ + + def __init__( + self, + url: str, + *, + api_key: Optional[str] = None, + mode: Literal["scrape", "crawl"] = "scrape", + params: Optional[dict] = {"return_format": "markdown"}, + ): + """Initialize with API key and URL. + + Args: + url: The URL to be processed. + api_key: The Spider API key. If not specified, will be read from env + var `SPIDER_API_KEY`. + mode: The mode to run the loader in. Default is "scrape". + Options include "scrape" (single page) and "crawl" (with deeper + crawling following subpages). + params: Additional parameters for the Spider API. + """ + try: + from spider import Spider # noqa: F401 + except ImportError: + raise ImportError( + "`spider` package not found, please run `pip install spider-client`" + ) + if mode not in ("scrape", "crawl"): + raise ValueError( + f"Unrecognized mode '{mode}'. Expected one of 'scrape', 'crawl'." + ) + # If `params` is `None`, initialize it as an empty dictionary + if params is None: + params = {} + + # Add a default value for 'metadata' if it's not already present + if "metadata" not in params: + params["metadata"] = True + + # Use the environment variable if the API key isn't provided + api_key = api_key or get_from_env("api_key", "SPIDER_API_KEY") + self.spider = Spider(api_key=api_key) + self.url = url + self.mode = mode + self.params = params + + def lazy_load(self) -> Iterator[Document]: + """Load documents based on the specified mode.""" + spider_docs = [] + + if self.mode == "scrape": + # Scrape a single page + response = self.spider.scrape_url(self.url, params=self.params) + if response: + spider_docs.append(response) + elif self.mode == "crawl": + # Crawl multiple pages + response = self.spider.crawl_url(self.url, params=self.params) + if response: + spider_docs.extend(response) + + for doc in spider_docs: + if self.mode == "scrape": + # Ensure page_content is also not None + page_content = doc[0].get("content", "") + + # Ensure metadata is also not None + metadata = doc[0].get("metadata", {}) + + yield Document(page_content=page_content, metadata=metadata) + if self.mode == "crawl": + # Ensure page_content is also not None + page_content = doc.get("content", "") + + # Ensure metadata is also not None + metadata = doc.get("metadata", {}) + + if page_content is not None: + yield Document( + page_content=page_content, + metadata=metadata, + ) diff --git a/libs/community/tests/unit_tests/document_loaders/test_imports.py b/libs/community/tests/unit_tests/document_loaders/test_imports.py index dc28a03e68648..cc4ad1a6f9990 100644 --- a/libs/community/tests/unit_tests/document_loaders/test_imports.py +++ b/libs/community/tests/unit_tests/document_loaders/test_imports.py @@ -143,6 +143,7 @@ "SitemapLoader", "SlackDirectoryLoader", "SnowflakeLoader", + "SpiderLoader", "SpreedlyLoader", "StripeLoader", "SurrealDBLoader",