v1.0.0-beta.1
Pre-release
Pre-release
·
22 commits
to main
since this release
scrapex v1.0.0-beta.1
First beta release of scrapex — a modern web scraper with LLM-enhanced extraction, vector embeddings, and comprehensive feed parsing.
Highlights
- Vector Embeddings — Generate embeddings from scraped content with provider-agnostic design
- RSS/Atom Feed Parsing — Parse RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds with pagination support
- Improved LLM provider architecture with shared HTTP infrastructure
- Comprehensive test coverage
Installation
npm install scrapex@betaOptional peer dependencies
# For LLM features
npm install openai
npm install @anthropic-ai/sdk
# For local embeddings (zero API cost)
npm install @huggingface/transformers onnxruntime-node
# For JavaScript-rendered pages
npm install puppeteerWhat's New
Vector Embeddings
Generate embeddings from scraped content for semantic search, RAG, and similarity matching:
import { scrape } from "scrapex";
import { createOpenAIEmbedding } from "scrapex/embeddings";
const result = await scrape("https://example.com/article", {
embeddings: {
provider: { type: "custom", provider: createOpenAIEmbedding() },
model: "text-embedding-3-small",
},
});
if (result.embeddings?.status === "success") {
console.log(result.embeddings.vector); // [0.023, -0.041, ...]
}Provider Support:
| Provider | Factory Function | Notes |
|---|---|---|
| OpenAI | createOpenAIEmbedding() |
text-embedding-3-small/large |
| Azure OpenAI | createAzureEmbedding() |
Enterprise deployments |
| Cohere | createCohereEmbedding() |
embed-english-v3.0 |
| HuggingFace | createHuggingFaceEmbedding() |
Inference API |
| Ollama | createOllamaEmbedding() |
Local models |
| Transformers.js | createTransformersEmbedding() |
Zero-cost local inference |
Features:
- Optional PII redaction (email, phone, SSN, credit card, IP) before sending to APIs
- Smart chunking with configurable size/overlap
- Aggregation modes:
average,max,first,all - Content-addressable caching
- Resilience: retry, circuit breaker, rate limiting
- SSRF protection for custom endpoints
Standalone Functions:
import { embed, embedScrapedData } from "scrapex/embeddings";
// Embed arbitrary text
const result = await embed("Your text here", { provider, model });
// Embed previously scraped data
const embedResult = await embedScrapedData(scrapedData, { provider });RSS/Atom Feed Parsing
Parse RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds:
import { RSSParser, fetchFeed, paginateFeed } from "scrapex";
// Parse feed XML
const parser = new RSSParser();
const result = parser.parse(feedXml, "https://example.com/feed.xml");
console.log(result.data.format); // 'rss2' | 'rss1' | 'atom'
console.log(result.data.title); // Feed title
console.log(result.data.items); // Array of feed items
// Fetch and parse in one call
const feed = await fetchFeed("https://example.com/feed.xml");
// Paginate through feeds with rel="next" links (Atom)
for await (const page of paginateFeed("https://example.com/atom")) {
console.log(`Page with ${page.items.length} items`);
}Feed Utilities:
import {
discoverFeeds, // Find feed URLs in HTML
filterByDate, // Filter items by date range
feedToMarkdown, // Convert feed to markdown (LLM-ready)
feedToText, // Convert feed to plain text
} from "scrapex";Custom Fields (Podcasts/Media):
const parser = new RSSParser({
customFields: {
duration: "itunes\\:duration",
explicit: "itunes\\:explicit",
},
});Security:
- HTTPS-only URL resolution in feeds
- XML mode parsing (prevents XSS vectors)
Breaking Changes
- LLM providers refactored to use shared
HttpLLMProviderbase - Removed direct
OpenAIProviderandAnthropicProviderclasses (use factory functions)
Migration Guide
LLM Provider Changes:
// Before (alpha)
import { OpenAIProvider } from "scrapex/llm";
const llm = new OpenAIProvider({ apiKey: "..." });
// After (beta)
import { createOpenAI } from "scrapex/llm";
const llm = createOpenAI({ apiKey: "..." });Notes
- This is a beta release; APIs are stable but minor changes may occur before
v1.0.0. - Requires Node.js 20+.
Full Changelog: v1.0.0-alpha.1...v1.0.0-beta.1