scrapex v1.0.0-beta.1

First beta release of scrapex — a modern web scraper with LLM-enhanced extraction, vector embeddings, and comprehensive feed parsing.

Highlights

Vector Embeddings — Generate embeddings from scraped content with provider-agnostic design
RSS/Atom Feed Parsing — Parse RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds with pagination support
Improved LLM provider architecture with shared HTTP infrastructure
Comprehensive test coverage

Installation

npm install scrapex@beta

Optional peer dependencies

# For LLM features
npm install openai
npm install @anthropic-ai/sdk

# For local embeddings (zero API cost)
npm install @huggingface/transformers onnxruntime-node

# For JavaScript-rendered pages
npm install puppeteer

What's New

Vector Embeddings

Generate embeddings from scraped content for semantic search, RAG, and similarity matching:

import { scrape } from "scrapex";
import { createOpenAIEmbedding } from "scrapex/embeddings";

const result = await scrape("https://example.com/article", {
  embeddings: {
    provider: { type: "custom", provider: createOpenAIEmbedding() },
    model: "text-embedding-3-small",
  },
});

if (result.embeddings?.status === "success") {
  console.log(result.embeddings.vector); // [0.023, -0.041, ...]
}

Provider Support:

Provider	Factory Function	Notes
OpenAI	`createOpenAIEmbedding()`	text-embedding-3-small/large
Azure OpenAI	`createAzureEmbedding()`	Enterprise deployments
Cohere	`createCohereEmbedding()`	embed-english-v3.0
HuggingFace	`createHuggingFaceEmbedding()`	Inference API
Ollama	`createOllamaEmbedding()`	Local models
Transformers.js	`createTransformersEmbedding()`	Zero-cost local inference

Features:

Optional PII redaction (email, phone, SSN, credit card, IP) before sending to APIs
Smart chunking with configurable size/overlap
Aggregation modes: average, max, first, all
Content-addressable caching
Resilience: retry, circuit breaker, rate limiting
SSRF protection for custom endpoints

Standalone Functions:

import { embed, embedScrapedData } from "scrapex/embeddings";

// Embed arbitrary text
const result = await embed("Your text here", { provider, model });

// Embed previously scraped data
const embedResult = await embedScrapedData(scrapedData, { provider });

RSS/Atom Feed Parsing

Parse RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds:

import { RSSParser, fetchFeed, paginateFeed } from "scrapex";

// Parse feed XML
const parser = new RSSParser();
const result = parser.parse(feedXml, "https://example.com/feed.xml");

console.log(result.data.format); // 'rss2' | 'rss1' | 'atom'
console.log(result.data.title); // Feed title
console.log(result.data.items); // Array of feed items

// Fetch and parse in one call
const feed = await fetchFeed("https://example.com/feed.xml");

// Paginate through feeds with rel="next" links (Atom)
for await (const page of paginateFeed("https://example.com/atom")) {
  console.log(`Page with ${page.items.length} items`);
}

Feed Utilities:

import {
  discoverFeeds, // Find feed URLs in HTML
  filterByDate, // Filter items by date range
  feedToMarkdown, // Convert feed to markdown (LLM-ready)
  feedToText, // Convert feed to plain text
} from "scrapex";

Custom Fields (Podcasts/Media):

const parser = new RSSParser({
  customFields: {
    duration: "itunes\\:duration",
    explicit: "itunes\\:explicit",
  },
});

Security:

HTTPS-only URL resolution in feeds
XML mode parsing (prevents XSS vectors)

Breaking Changes

LLM providers refactored to use shared HttpLLMProvider base
Removed direct OpenAIProvider and AnthropicProvider classes (use factory functions)

Migration Guide

LLM Provider Changes:

// Before (alpha)
import { OpenAIProvider } from "scrapex/llm";
const llm = new OpenAIProvider({ apiKey: "..." });

// After (beta)
import { createOpenAI } from "scrapex/llm";
const llm = createOpenAI({ apiKey: "..." });

Notes

This is a beta release; APIs are stable but minor changes may occur before v1.0.0.
Requires Node.js 20+.

Full Changelog: v1.0.0-alpha.1...v1.0.0-beta.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0.0-beta.1

Choose a tag to compare

Sorry, something went wrong.