Skip to content

Commit

Permalink
community: Spider integration (langchain-ai#20937)
Browse files Browse the repository at this point in the history
Added the [Spider.cloud](https://spider.cloud) document loader.
[Spider](https://github.com/spider-rs/spider) is the
[fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md)
and cheapest crawler that returns LLM-ready data.

```
- **Description:** Adds Spider data loader
- **Dependencies:** spider-client
- **Twitter handle:** @WilliamEspegren 
```

---------

Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: = <=>
Co-authored-by: Chester Curme <chester.curme@gmail.com>
  • Loading branch information
4 people committed Apr 27, 2024
1 parent 6342217 commit 804390b
Show file tree
Hide file tree
Showing 6 changed files with 223 additions and 1 deletion.
95 changes: 95 additions & 0 deletions docs/docs/integrations/document_loaders/spider.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Spider\n",
"[Spider](https://spider.cloud/) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md) and most affordable crawler and scraper that returns LLM-ready data.\n",
"\n",
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pip install spider-client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Usage\n",
"To use spider you need to have an API key from [spider.cloud](https://spider.cloud/)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Document(page_content='Spider - Fastest Web Crawler built for AI Agents and Large Language Models[Spider v1 Logo Spider ](/)The World\\'s Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"], \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/crawl\\', headers=headers, json=json_data)print(response.json())```Example ResponseScrape with no headaches----------* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM ResponsesThe Fastest Web Crawler----------* Powered by [spider-rs](https://github.com/spider-rs/spider)* Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* Cost effectiveScrape Anything with AI----------* Custom scripting browser* Custom data extraction* Data pipelines* Detailed insights* Advanced labeling[API](/docs/api) [Price](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', metadata={'description': 'Collect data rapidly from any website. Seamlessly scrape websites and get data tailored for LLM workloads.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 33743, 'keywords': None, 'pathname': '/', 'resource_type': 'html', 'title': 'Spider - Fastest Web Crawler built for AI Agents and Large Language Models', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/index.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'})]\n"
]
}
],
"source": [
"from langchain_community.document_loaders import SpiderLoader\n",
"\n",
"loader = SpiderLoader(\n",
" api_key=\"YOUR_API_KEY\",\n",
" url=\"https://spider.cloud\",\n",
" mode=\"scrape\", # if no API key is provided it looks for SPIDER_API_KEY in env\n",
")\n",
"\n",
"data = loader.load()\n",
"print(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Modes\n",
"- `scrape`: Default mode that scrapes a single URL\n",
"- `crawl`: Crawl all subpages of the domain url provided"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Crawler options\n",
"The `params` parameter is a dictionary that can be passed to the loader. See the [Spider documentation](https://spider.cloud/docs/api) to see all available parameters"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
2 changes: 1 addition & 1 deletion docs/docs/integrations/document_loaders/web_base.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
"\n",
"This covers how to use `WebBaseLoader` to load all text from `HTML` webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as `IMSDbLoader`, `AZLyricsLoader`, and `CollegeConfidentialLoader`. \n",
"\n",
"If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader`.\n"
"If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader` or the faster option `SpiderLoader`.\n"
]
},
{
Expand Down
26 changes: 26 additions & 0 deletions docs/docs/modules/data_connection/document_loaders/html.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,32 @@ data

</CodeOutputBlock>

## Loading HTML with SpiderLoader

[Spider](https://spider.cloud/?ref=langchain) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md#benchmark-results) crawler. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.

Spider allows you to use high performance proxies to prevent detection, caches AI actions, webhooks for crawling status, scheduled crawls etc...

## Prerequisite

You need to have a Spider api key to use this loader. You can get one on [spider.cloud](https://spider.cloud).

```python
%pip install --upgrade --quiet langchain langchain-community spider-client
```
```python
from langchain_community.document_loaders import SpiderLoader

loader = SpiderLoader(
api_key="YOUR_API_KEY", url="https://spider.cloud", mode="crawl"
)

data = loader.load()
```

For guides and documentation, visit [Spider](https://spider.cloud/docs/api)


## Loading HTML with FireCrawlLoader

[FireCrawl](https://firecrawl.dev/?ref=langchain) crawls and convert any website into markdown. It crawls all accessible subpages and give you clean markdown and metadata for each.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
Document, <name>TextSplitter
"""

import importlib
from typing import TYPE_CHECKING, Any

Expand Down Expand Up @@ -409,6 +410,9 @@
from langchain_community.document_loaders.snowflake_loader import (
SnowflakeLoader, # noqa: F401
)
from langchain_community.document_loaders.spider import (
SpiderLoader, # noqa: F401
)
from langchain_community.document_loaders.spreedly import (
SpreedlyLoader, # noqa: F401
)
Expand Down Expand Up @@ -647,6 +651,7 @@
"SitemapLoader",
"SlackDirectoryLoader",
"SnowflakeLoader",
"SpiderLoader",
"SpreedlyLoader",
"StripeLoader",
"SurrealDBLoader",
Expand Down Expand Up @@ -836,6 +841,7 @@
"SitemapLoader": "langchain_community.document_loaders.sitemap",
"SlackDirectoryLoader": "langchain_community.document_loaders.slack_directory",
"SnowflakeLoader": "langchain_community.document_loaders.snowflake_loader",
"SpiderLoader": "langchain_community.document_loaders.spider",
"SpreedlyLoader": "langchain_community.document_loaders.spreedly",
"StripeLoader": "langchain_community.document_loaders.stripe",
"SurrealDBLoader": "langchain_community.document_loaders.surrealdb",
Expand Down
94 changes: 94 additions & 0 deletions libs/community/langchain_community/document_loaders/spider.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
from typing import Iterator, Literal, Optional

from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document
from langchain_core.utils import get_from_env


class SpiderLoader(BaseLoader):
"""Load web pages as Documents using Spider AI.
Must have the Python package `spider-client` installed and a Spider API key.
See https://spider.cloud for more.
"""

def __init__(
self,
url: str,
*,
api_key: Optional[str] = None,
mode: Literal["scrape", "crawl"] = "scrape",
params: Optional[dict] = {"return_format": "markdown"},
):
"""Initialize with API key and URL.
Args:
url: The URL to be processed.
api_key: The Spider API key. If not specified, will be read from env
var `SPIDER_API_KEY`.
mode: The mode to run the loader in. Default is "scrape".
Options include "scrape" (single page) and "crawl" (with deeper
crawling following subpages).
params: Additional parameters for the Spider API.
"""
try:
from spider import Spider # noqa: F401
except ImportError:
raise ImportError(
"`spider` package not found, please run `pip install spider-client`"
)
if mode not in ("scrape", "crawl"):
raise ValueError(
f"Unrecognized mode '{mode}'. Expected one of 'scrape', 'crawl'."
)
# If `params` is `None`, initialize it as an empty dictionary
if params is None:
params = {}

# Add a default value for 'metadata' if it's not already present
if "metadata" not in params:
params["metadata"] = True

# Use the environment variable if the API key isn't provided
api_key = api_key or get_from_env("api_key", "SPIDER_API_KEY")
self.spider = Spider(api_key=api_key)
self.url = url
self.mode = mode
self.params = params

def lazy_load(self) -> Iterator[Document]:
"""Load documents based on the specified mode."""
spider_docs = []

if self.mode == "scrape":
# Scrape a single page
response = self.spider.scrape_url(self.url, params=self.params)
if response:
spider_docs.append(response)
elif self.mode == "crawl":
# Crawl multiple pages
response = self.spider.crawl_url(self.url, params=self.params)
if response:
spider_docs.extend(response)

for doc in spider_docs:
if self.mode == "scrape":
# Ensure page_content is also not None
page_content = doc[0].get("content", "")

# Ensure metadata is also not None
metadata = doc[0].get("metadata", {})

yield Document(page_content=page_content, metadata=metadata)
if self.mode == "crawl":
# Ensure page_content is also not None
page_content = doc.get("content", "")

# Ensure metadata is also not None
metadata = doc.get("metadata", {})

if page_content is not None:
yield Document(
page_content=page_content,
metadata=metadata,
)
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@
"SitemapLoader",
"SlackDirectoryLoader",
"SnowflakeLoader",
"SpiderLoader",
"SpreedlyLoader",
"StripeLoader",
"SurrealDBLoader",
Expand Down

0 comments on commit 804390b

Please sign in to comment.