forked from langchain-ai/langchain
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
community: Spider integration (langchain-ai#20937)
Added the [Spider.cloud](https://spider.cloud) document loader. [Spider](https://github.com/spider-rs/spider) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md) and cheapest crawler that returns LLM-ready data. ``` - **Description:** Adds Spider data loader - **Dependencies:** spider-client - **Twitter handle:** @WilliamEspegren ``` --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: = <=> Co-authored-by: Chester Curme <chester.curme@gmail.com>
- Loading branch information
1 parent
6342217
commit 804390b
Showing
6 changed files
with
223 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Spider\n", | ||
"[Spider](https://spider.cloud/) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md) and most affordable crawler and scraper that returns LLM-ready data.\n", | ||
"\n", | ||
"## Setup" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"pip install spider-client" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Usage\n", | ||
"To use spider you need to have an API key from [spider.cloud](https://spider.cloud/)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"[Document(page_content='Spider - Fastest Web Crawler built for AI Agents and Large Language Models[Spider v1 Logo Spider ](/)The World\\'s Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"], \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/crawl\\', headers=headers, json=json_data)print(response.json())```Example ResponseScrape with no headaches----------* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM ResponsesThe Fastest Web Crawler----------* Powered by [spider-rs](https://github.com/spider-rs/spider)* Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* Cost effectiveScrape Anything with AI----------* Custom scripting browser* Custom data extraction* Data pipelines* Detailed insights* Advanced labeling[API](/docs/api) [Price](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', metadata={'description': 'Collect data rapidly from any website. Seamlessly scrape websites and get data tailored for LLM workloads.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 33743, 'keywords': None, 'pathname': '/', 'resource_type': 'html', 'title': 'Spider - Fastest Web Crawler built for AI Agents and Large Language Models', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/index.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'})]\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from langchain_community.document_loaders import SpiderLoader\n", | ||
"\n", | ||
"loader = SpiderLoader(\n", | ||
" api_key=\"YOUR_API_KEY\",\n", | ||
" url=\"https://spider.cloud\",\n", | ||
" mode=\"scrape\", # if no API key is provided it looks for SPIDER_API_KEY in env\n", | ||
")\n", | ||
"\n", | ||
"data = loader.load()\n", | ||
"print(data)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Modes\n", | ||
"- `scrape`: Default mode that scrapes a single URL\n", | ||
"- `crawl`: Crawl all subpages of the domain url provided" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Crawler options\n", | ||
"The `params` parameter is a dictionary that can be passed to the loader. See the [Spider documentation](https://spider.cloud/docs/api) to see all available parameters" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "venv", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.12.1" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
94 changes: 94 additions & 0 deletions
94
libs/community/langchain_community/document_loaders/spider.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
from typing import Iterator, Literal, Optional | ||
|
||
from langchain_core.document_loaders import BaseLoader | ||
from langchain_core.documents import Document | ||
from langchain_core.utils import get_from_env | ||
|
||
|
||
class SpiderLoader(BaseLoader): | ||
"""Load web pages as Documents using Spider AI. | ||
Must have the Python package `spider-client` installed and a Spider API key. | ||
See https://spider.cloud for more. | ||
""" | ||
|
||
def __init__( | ||
self, | ||
url: str, | ||
*, | ||
api_key: Optional[str] = None, | ||
mode: Literal["scrape", "crawl"] = "scrape", | ||
params: Optional[dict] = {"return_format": "markdown"}, | ||
): | ||
"""Initialize with API key and URL. | ||
Args: | ||
url: The URL to be processed. | ||
api_key: The Spider API key. If not specified, will be read from env | ||
var `SPIDER_API_KEY`. | ||
mode: The mode to run the loader in. Default is "scrape". | ||
Options include "scrape" (single page) and "crawl" (with deeper | ||
crawling following subpages). | ||
params: Additional parameters for the Spider API. | ||
""" | ||
try: | ||
from spider import Spider # noqa: F401 | ||
except ImportError: | ||
raise ImportError( | ||
"`spider` package not found, please run `pip install spider-client`" | ||
) | ||
if mode not in ("scrape", "crawl"): | ||
raise ValueError( | ||
f"Unrecognized mode '{mode}'. Expected one of 'scrape', 'crawl'." | ||
) | ||
# If `params` is `None`, initialize it as an empty dictionary | ||
if params is None: | ||
params = {} | ||
|
||
# Add a default value for 'metadata' if it's not already present | ||
if "metadata" not in params: | ||
params["metadata"] = True | ||
|
||
# Use the environment variable if the API key isn't provided | ||
api_key = api_key or get_from_env("api_key", "SPIDER_API_KEY") | ||
self.spider = Spider(api_key=api_key) | ||
self.url = url | ||
self.mode = mode | ||
self.params = params | ||
|
||
def lazy_load(self) -> Iterator[Document]: | ||
"""Load documents based on the specified mode.""" | ||
spider_docs = [] | ||
|
||
if self.mode == "scrape": | ||
# Scrape a single page | ||
response = self.spider.scrape_url(self.url, params=self.params) | ||
if response: | ||
spider_docs.append(response) | ||
elif self.mode == "crawl": | ||
# Crawl multiple pages | ||
response = self.spider.crawl_url(self.url, params=self.params) | ||
if response: | ||
spider_docs.extend(response) | ||
|
||
for doc in spider_docs: | ||
if self.mode == "scrape": | ||
# Ensure page_content is also not None | ||
page_content = doc[0].get("content", "") | ||
|
||
# Ensure metadata is also not None | ||
metadata = doc[0].get("metadata", {}) | ||
|
||
yield Document(page_content=page_content, metadata=metadata) | ||
if self.mode == "crawl": | ||
# Ensure page_content is also not None | ||
page_content = doc.get("content", "") | ||
|
||
# Ensure metadata is also not None | ||
metadata = doc.get("metadata", {}) | ||
|
||
if page_content is not None: | ||
yield Document( | ||
page_content=page_content, | ||
metadata=metadata, | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters