community: Spider integration (langchain-ai#20937)

Added the [Spider.cloud](https://spider.cloud) document loader. [Spider](https://github.com/spider-rs/spider) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md) and cheapest crawler that returns LLM-ready data. ``` - **Description:** Adds Spider data loader - **Dependencies:** spider-client - **Twitter handle:** @WilliamEspegren ``` --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: = <=> Co-authored-by: Chester Curme <chester.curme@gmail.com>
daxa-ai · Apr 27, 2024 · 804390b · 804390b
1 parent 6342217
commit 804390b
Show file tree

Hide file tree

Showing 6 changed files with 223 additions and 1 deletion.
diff --git a/docs/docs/integrations/document_loaders/spider.ipynb b/docs/docs/integrations/document_loaders/spider.ipynb
@@ -0,0 +1,95 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Spider\n",
+    "[Spider](https://spider.cloud/) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md) and most affordable crawler and scraper that returns LLM-ready data.\n",
+    "\n",
+    "## Setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pip install spider-client"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Usage\n",
+    "To use spider you need to have an API key from [spider.cloud](https://spider.cloud/)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[Document(page_content='Spider - Fastest Web Crawler built for AI Agents and Large Language Models[Spider v1 Logo Spider ](/)The World\\'s Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy```import requests, osheaders = {    \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"],    \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/crawl\\',  headers=headers,  json=json_data)print(response.json())```Example ResponseScrape with no headaches----------* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM ResponsesThe Fastest Web Crawler----------* Powered by [spider-rs](https://github.com/spider-rs/spider)* Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* Cost effectiveScrape Anything with AI----------* Custom scripting browser* Custom data extraction* Data pipelines* Detailed insights* Advanced labeling[API](/docs/api) [Price](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', metadata={'description': 'Collect data rapidly from any website. Seamlessly scrape websites and get data tailored for LLM workloads.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 33743, 'keywords': None, 'pathname': '/', 'resource_type': 'html', 'title': 'Spider - Fastest Web Crawler built for AI Agents and Large Language Models', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/index.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'})]\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain_community.document_loaders import SpiderLoader\n",
+    "\n",
+    "loader = SpiderLoader(\n",
+    "    api_key=\"YOUR_API_KEY\",\n",
+    "    url=\"https://spider.cloud\",\n",
+    "    mode=\"scrape\",  # if no API key is provided it looks for SPIDER_API_KEY in env\n",
+    ")\n",
+    "\n",
+    "data = loader.load()\n",
+    "print(data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Modes\n",
+    "- `scrape`: Default mode that scrapes a single URL\n",
+    "- `crawl`: Crawl all subpages of the domain url provided"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Crawler options\n",
+    "The `params` parameter is a dictionary that can be passed to the loader. See the [Spider documentation](https://spider.cloud/docs/api) to see all available parameters"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/docs/docs/integrations/document_loaders/web_base.ipynb b/docs/docs/integrations/document_loaders/web_base.ipynb
@@ -9,7 +9,7 @@
     "\n",
     "This covers how to use `WebBaseLoader` to load all text from `HTML` webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as `IMSDbLoader`, `AZLyricsLoader`, and `CollegeConfidentialLoader`. \n",
     "\n",
-    "If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader`.\n"
+    "If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader` or the faster option `SpiderLoader`.\n"
    ]
   },
   {

diff --git a/docs/docs/modules/data_connection/document_loaders/html.mdx b/docs/docs/modules/data_connection/document_loaders/html.mdx
@@ -55,6 +55,32 @@ data
 
 </CodeOutputBlock>
 
+## Loading HTML with SpiderLoader
+
+[Spider](https://spider.cloud/?ref=langchain) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md#benchmark-results) crawler. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.
+
+Spider allows you to use high performance proxies to prevent detection, caches AI actions, webhooks for crawling status, scheduled crawls etc... 
+
+## Prerequisite
+
+You need to have a Spider api key to use this loader. You can get one on [spider.cloud](https://spider.cloud).
+
+```python
+%pip install --upgrade --quiet  langchain langchain-community spider-client
+```
+```python
+from langchain_community.document_loaders import SpiderLoader
+
+loader = SpiderLoader(
+    api_key="YOUR_API_KEY", url="https://spider.cloud", mode="crawl"
+)
+
+data = loader.load()
+```
+
+For guides and documentation, visit [Spider](https://spider.cloud/docs/api)
+
+
 ## Loading HTML with FireCrawlLoader
 
 [FireCrawl](https://firecrawl.dev/?ref=langchain) crawls and convert any website into markdown. It crawls all accessible subpages and give you clean markdown and metadata for each.

diff --git a/libs/community/langchain_community/document_loaders/__init__.py b/libs/community/langchain_community/document_loaders/__init__.py
@@ -14,6 +14,7 @@
 
     Document, <name>TextSplitter
 """
+
 import importlib
 from typing import TYPE_CHECKING, Any
 
@@ -409,6 +410,9 @@
     from langchain_community.document_loaders.snowflake_loader import (
         SnowflakeLoader,  # noqa: F401
     )
+    from langchain_community.document_loaders.spider import (
+        SpiderLoader,  # noqa: F401
+    )
     from langchain_community.document_loaders.spreedly import (
         SpreedlyLoader,  # noqa: F401
     )
@@ -647,6 +651,7 @@
     "SitemapLoader",
     "SlackDirectoryLoader",
     "SnowflakeLoader",
+    "SpiderLoader",
     "SpreedlyLoader",
     "StripeLoader",
     "SurrealDBLoader",
@@ -836,6 +841,7 @@
     "SitemapLoader": "langchain_community.document_loaders.sitemap",
     "SlackDirectoryLoader": "langchain_community.document_loaders.slack_directory",
     "SnowflakeLoader": "langchain_community.document_loaders.snowflake_loader",
+    "SpiderLoader": "langchain_community.document_loaders.spider",
     "SpreedlyLoader": "langchain_community.document_loaders.spreedly",
     "StripeLoader": "langchain_community.document_loaders.stripe",
     "SurrealDBLoader": "langchain_community.document_loaders.surrealdb",

diff --git a/libs/community/langchain_community/document_loaders/spider.py b/libs/community/langchain_community/document_loaders/spider.py
@@ -0,0 +1,94 @@
+from typing import Iterator, Literal, Optional
+
+from langchain_core.document_loaders import BaseLoader
+from langchain_core.documents import Document
+from langchain_core.utils import get_from_env
+
+
+class SpiderLoader(BaseLoader):
+    """Load web pages as Documents using Spider AI.
+
+    Must have the Python package `spider-client` installed and a Spider API key.
+    See https://spider.cloud for more.
+    """
+
+    def __init__(
+        self,
+        url: str,
+        *,
+        api_key: Optional[str] = None,
+        mode: Literal["scrape", "crawl"] = "scrape",
+        params: Optional[dict] = {"return_format": "markdown"},
+    ):
+        """Initialize with API key and URL.
+
+        Args:
+            url: The URL to be processed.
+            api_key: The Spider API key. If not specified, will be read from env
+            var `SPIDER_API_KEY`.
+            mode: The mode to run the loader in. Default is "scrape".
+                 Options include "scrape" (single page) and "crawl" (with deeper
+                 crawling following subpages).
+            params: Additional parameters for the Spider API.
+        """
+        try:
+            from spider import Spider  # noqa: F401
+        except ImportError:
+            raise ImportError(
+                "`spider` package not found, please run `pip install spider-client`"
+            )
+        if mode not in ("scrape", "crawl"):
+            raise ValueError(
+                f"Unrecognized mode '{mode}'. Expected one of 'scrape', 'crawl'."
+            )
+        # If `params` is `None`, initialize it as an empty dictionary
+        if params is None:
+            params = {}
+
+        # Add a default value for 'metadata' if it's not already present
+        if "metadata" not in params:
+            params["metadata"] = True
+
+        # Use the environment variable if the API key isn't provided
+        api_key = api_key or get_from_env("api_key", "SPIDER_API_KEY")
+        self.spider = Spider(api_key=api_key)
+        self.url = url
+        self.mode = mode
+        self.params = params
+
+    def lazy_load(self) -> Iterator[Document]:
+        """Load documents based on the specified mode."""
+        spider_docs = []
+
+        if self.mode == "scrape":
+            # Scrape a single page
+            response = self.spider.scrape_url(self.url, params=self.params)
+            if response:
+                spider_docs.append(response)
+        elif self.mode == "crawl":
+            # Crawl multiple pages
+            response = self.spider.crawl_url(self.url, params=self.params)
+            if response:
+                spider_docs.extend(response)
+
+        for doc in spider_docs:
+            if self.mode == "scrape":
+                # Ensure page_content is also not None
+                page_content = doc[0].get("content", "")
+
+                # Ensure metadata is also not None
+                metadata = doc[0].get("metadata", {})
+
+                yield Document(page_content=page_content, metadata=metadata)
+            if self.mode == "crawl":
+                # Ensure page_content is also not None
+                page_content = doc.get("content", "")
+
+                # Ensure metadata is also not None
+                metadata = doc.get("metadata", {})
+
+                if page_content is not None:
+                    yield Document(
+                        page_content=page_content,
+                        metadata=metadata,
+                    )
diff --git a/libs/community/tests/unit_tests/document_loaders/test_imports.py b/libs/community/tests/unit_tests/document_loaders/test_imports.py
@@ -143,6 +143,7 @@
     "SitemapLoader",
     "SlackDirectoryLoader",
     "SnowflakeLoader",
+    "SpiderLoader",
     "SpreedlyLoader",
     "StripeLoader",
     "SurrealDBLoader",