Skip to content

v1.0.0-beta.1

Pre-release
Pre-release

Choose a tag to compare

@developer-rakeshpaul developer-rakeshpaul released this 02 Jan 06:28
· 22 commits to main since this release

scrapex v1.0.0-beta.1

First beta release of scrapex — a modern web scraper with LLM-enhanced extraction, vector embeddings, and comprehensive feed parsing.

Highlights

  • Vector Embeddings — Generate embeddings from scraped content with provider-agnostic design
  • RSS/Atom Feed Parsing — Parse RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds with pagination support
  • Improved LLM provider architecture with shared HTTP infrastructure
  • Comprehensive test coverage

Installation

npm install scrapex@beta

Optional peer dependencies

# For LLM features
npm install openai
npm install @anthropic-ai/sdk

# For local embeddings (zero API cost)
npm install @huggingface/transformers onnxruntime-node

# For JavaScript-rendered pages
npm install puppeteer

What's New

Vector Embeddings

Generate embeddings from scraped content for semantic search, RAG, and similarity matching:

import { scrape } from "scrapex";
import { createOpenAIEmbedding } from "scrapex/embeddings";

const result = await scrape("https://example.com/article", {
  embeddings: {
    provider: { type: "custom", provider: createOpenAIEmbedding() },
    model: "text-embedding-3-small",
  },
});

if (result.embeddings?.status === "success") {
  console.log(result.embeddings.vector); // [0.023, -0.041, ...]
}

Provider Support:

Provider Factory Function Notes
OpenAI createOpenAIEmbedding() text-embedding-3-small/large
Azure OpenAI createAzureEmbedding() Enterprise deployments
Cohere createCohereEmbedding() embed-english-v3.0
HuggingFace createHuggingFaceEmbedding() Inference API
Ollama createOllamaEmbedding() Local models
Transformers.js createTransformersEmbedding() Zero-cost local inference

Features:

  • Optional PII redaction (email, phone, SSN, credit card, IP) before sending to APIs
  • Smart chunking with configurable size/overlap
  • Aggregation modes: average, max, first, all
  • Content-addressable caching
  • Resilience: retry, circuit breaker, rate limiting
  • SSRF protection for custom endpoints

Standalone Functions:

import { embed, embedScrapedData } from "scrapex/embeddings";

// Embed arbitrary text
const result = await embed("Your text here", { provider, model });

// Embed previously scraped data
const embedResult = await embedScrapedData(scrapedData, { provider });

RSS/Atom Feed Parsing

Parse RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds:

import { RSSParser, fetchFeed, paginateFeed } from "scrapex";

// Parse feed XML
const parser = new RSSParser();
const result = parser.parse(feedXml, "https://example.com/feed.xml");

console.log(result.data.format); // 'rss2' | 'rss1' | 'atom'
console.log(result.data.title); // Feed title
console.log(result.data.items); // Array of feed items

// Fetch and parse in one call
const feed = await fetchFeed("https://example.com/feed.xml");

// Paginate through feeds with rel="next" links (Atom)
for await (const page of paginateFeed("https://example.com/atom")) {
  console.log(`Page with ${page.items.length} items`);
}

Feed Utilities:

import {
  discoverFeeds, // Find feed URLs in HTML
  filterByDate, // Filter items by date range
  feedToMarkdown, // Convert feed to markdown (LLM-ready)
  feedToText, // Convert feed to plain text
} from "scrapex";

Custom Fields (Podcasts/Media):

const parser = new RSSParser({
  customFields: {
    duration: "itunes\\:duration",
    explicit: "itunes\\:explicit",
  },
});

Security:

  • HTTPS-only URL resolution in feeds
  • XML mode parsing (prevents XSS vectors)

Breaking Changes

  • LLM providers refactored to use shared HttpLLMProvider base
  • Removed direct OpenAIProvider and AnthropicProvider classes (use factory functions)

Migration Guide

LLM Provider Changes:

// Before (alpha)
import { OpenAIProvider } from "scrapex/llm";
const llm = new OpenAIProvider({ apiKey: "..." });

// After (beta)
import { createOpenAI } from "scrapex/llm";
const llm = createOpenAI({ apiKey: "..." });

Notes

  • This is a beta release; APIs are stable but minor changes may occur before v1.0.0.
  • Requires Node.js 20+.

Full Changelog: v1.0.0-alpha.1...v1.0.0-beta.1