scrapex v1.0.0

The first stable release of scrapex — a modern web scraper with LLM-enhanced extraction, vector embeddings, content normalization, and comprehensive feed parsing.

Installation

npm install scrapex

Optional Peer Dependencies

# For LLM features
npm install openai              # OpenAI/Ollama/LM Studio
npm install @anthropic-ai/sdk   # Anthropic Claude

# For local embeddings (zero API cost)
npm install @huggingface/transformers onnxruntime-node

# For JavaScript-rendered pages
npm install puppeteer

Highlights

LLM-Ready Output — Content extracted as Markdown, optimized for AI consumption
Vector Embeddings — Generate embeddings with OpenAI, Azure, Cohere, HuggingFace, Ollama, or local Transformers.js
Content Normalization — Clean, boilerplate-free text for embeddings and RAG pipelines
RSS/Atom Feeds — Parse RSS 2.0, RSS 1.0 (RDF), and Atom feeds with Media RSS support
Provider-Agnostic LLM — Works with OpenAI, Anthropic, Ollama, LM Studio, or any OpenAI-compatible API
Extensible Pipeline — Priority-based pluggable extractors
TypeScript First — Full type safety with comprehensive exports
Dual Format — ESM and CommonJS builds

Core Features

Web Scraping

import { scrape, scrapeHtml } from 'scrapex';

// Fetch and extract
const result = await scrape('https://example.com/article', {
  timeout: 10000,
  extractContent: true,
  respectRobots: true,
});

console.log(result.title);       // Page title
console.log(result.content);     // Markdown content
console.log(result.textContent); // Plain text
console.log(result.excerpt);     // ~300 char preview

// Extract from raw HTML
const html = await fetchSomehow('https://example.com');
const result = await scrapeHtml(html, 'https://example.com');

Content Normalization

Clean, embedding-ready text with boilerplate removal and block classification:

const result = await scrape(url, {
  normalize: {
    mode: 'full',            // or 'summary' for score-ranked blocks
    removeBoilerplate: true, // filter nav, footer, promos
    maxChars: 5000,          // truncate at sentence boundary
  },
});

console.log(result.normalizedText);    // Clean text ready for embedding
console.log(result.normalizationMeta); // charCount, tokenEstimate, hash

Standalone Normalization

import { parseBlocks, normalizeText, defaultBlockClassifier, combineClassifiers } from 'scrapex';
import { load } from 'cheerio';

const $ = load(html);
const blocks = parseBlocks($);
const result = await normalizeText(blocks, { mode: 'full' });

Custom Classifiers

const myClassifier = combineClassifiers(
  defaultBlockClassifier,
  (block) => {
    if (block.text.includes('About the Author')) {
      return { accept: false, label: 'author-bio' };
    }
    return { accept: true };
  }
);

const result = await scrape(url, {
  normalize: { blockClassifier: myClassifier },
});

Vector Embeddings

Generate embeddings from scraped content for semantic search, RAG, and similarity matching:

import { scrape } from 'scrapex';
import { createOpenAIEmbedding } from 'scrapex/embeddings';

const result = await scrape('https://example.com/article', {
  embeddings: {
    provider: { type: 'custom', provider: createOpenAIEmbedding() },
    model: 'text-embedding-3-small',
  },
});

if (result.embeddings?.status === 'success') {
  console.log(result.embeddings.vector); // [0.023, -0.041, ...]
}

Embedding Providers

Provider	Factory Function	Notes
OpenAI	`createOpenAIEmbedding()`	text-embedding-3-small/large
Azure OpenAI	`createAzureEmbedding()`	Enterprise deployments
Cohere	`createCohereEmbedding()`	embed-english-v3.0
HuggingFace	`createHuggingFaceEmbedding()`	Inference API
Ollama	`createOllamaEmbedding()`	Local models
Transformers.js	`createTransformersEmbedding()`	Zero-cost local inference

Embedding Features

Optional PII redaction (email, phone, SSN, credit card, IP)
Smart chunking with configurable size/overlap
Aggregation modes: average, max, first, all
Content-addressable caching
Resilience: retry, circuit breaker, rate limiting
SSRF protection for custom endpoints

Standalone Embedding Functions

import { embed, embedScrapedData } from 'scrapex/embeddings';

// Embed arbitrary text
const result = await embed('Your text here', { provider, model });

// Embed previously scraped data
const embedResult = await embedScrapedData(scrapedData, { provider });

RSS/Atom Feed Parsing

Parse RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds:

import { RSSParser, fetchFeed, paginateFeed, normalizeFeedItem } from 'scrapex';

// Parse feed XML
const parser = new RSSParser();
const result = parser.parse(feedXml, 'https://example.com/feed.xml');

console.log(result.data.format); // 'rss2' | 'rss1' | 'atom'
console.log(result.data.title);  // Feed title
console.log(result.data.items);  // Array of feed items

// Fetch and parse in one call
const feed = await fetchFeed('https://example.com/feed.xml');

// Paginate through feeds with rel="next" links (RFC 5005)
for await (const page of paginateFeed('https://example.com/atom')) {
  console.log(`Page with ${page.items.length} items`);
}

// Normalize feed items for embeddings
for (const item of feed.data.items) {
  const normalized = await normalizeFeedItem(item, { mode: 'full' });
  console.log(normalized.text); // Clean text from content/description
}

Feed Utilities

import {
  discoverFeeds,   // Find feed URLs in HTML
  filterByDate,    // Filter items by date range
  feedToMarkdown,  // Convert feed to markdown (LLM-ready)
  feedToText,      // Convert feed to plain text
} from 'scrapex';

Media RSS & Custom Fields

const parser = new RSSParser({
  customFields: {
    // iTunes podcast fields
    duration: 'itunes\\:duration',
    explicit: 'itunes\\:explicit',
    // Media RSS with attribute extraction
    thumbnail: 'media\\:thumbnail@url',
    contentUrl: 'media\\:content@url',
  },
});

const result = parser.parse(podcastFeed);
console.log(result.data.items[0]?.customFields?.thumbnail);
// => "https://example.com/images/thumbnail.jpg"

LLM Integration

Provider-agnostic LLM support for content enhancement:

import { scrape } from 'scrapex';
import { createOpenAI } from 'scrapex/llm';

const llm = createOpenAI({ apiKey: process.env.OPENAI_API_KEY });

const result = await scrape(url, {
  llm,
  enhance: ['summarize', 'tags', 'entities', 'classify'],
});

console.log(result.summary);    // AI-generated summary
console.log(result.tags);       // Extracted keywords
console.log(result.entities);   // Named entities
console.log(result.category);   // Content classification

Structured Extraction

const result = await scrape(url, {
  llm,
  extract: {
    productName: 'string',
    price: 'number',
    features: 'string[]',
    inStock: 'boolean',
  },
});

console.log(result.extracted); // { productName, price, features, inStock }

Built-in Extractors

Extractor	Priority	Purpose
`MetaExtractor`	100	Open Graph, Twitter Cards, meta
`JsonLdExtractor`	80	JSON-LD structured data
`FaviconExtractor`	70	Favicon discovery
`ContentExtractor`	50	Readability + Turndown → Markdown
`LinksExtractor`	30	Link extraction

URL Utilities

import {
  normalizeUrl,
  resolveUrl,
  extractDomain,
  isExternalUrl,
  isValidUrl,
  matchesUrlPattern,
  getProtocol,
  getPath,
} from 'scrapex';

Types

Content Types

type BlockType =
  | 'paragraph' | 'heading' | 'list' | 'quote'
  | 'table' | 'code' | 'media' | 'nav'
  | 'footer' | 'promo' | 'legal' | 'unknown';

interface ContentBlock {
  type: BlockType;
  text: string;
  level?: 1 | 2 | 3 | 4 | 5 | 6;
  html?: string;
  attrs?: Record<string, string>;
}

interface NormalizationMeta {
  charCount: number;
  tokenEstimate: number;
  language: string;
  hash: string;
  blocksTotal: number;
  blocksAccepted: number;
  truncated: boolean;
}

Feed Types

interface FeedItem {
  id: string;
  title: string;
  link: string;
  description?: string;
  content?: string;
  author?: string;
  publishedAt?: string;      // ISO 8601
  updatedAt?: string;
  categories: string[];
  enclosure?: FeedEnclosure;
  customFields?: Record<string, string>;
}

interface ParsedFeed {
  format: 'rss2' | 'rss1' | 'atom';
  title: string;
  description?: string;
  link: string;
  next?: string;             // RFC 5005 pagination
  items: FeedItem[];
}

Security

HTTPS-only URL resolution in feeds
XML mode parsing (prevents XSS vectors)
ReDoS prevention via input limiting
maxBlocks enforcement (default 2000)
PII redaction before embedding API calls
SSRF protection for custom endpoints
SHA-256 content hashing for deduplication

Requirements

Node.js 20+

Migration from Beta

No breaking changes from v1.0.0-beta.5. Update your installation:

npm install scrapex@latest

Acknowledgments

Built with:

Mozilla Readability for content extraction
Cheerio for HTML parsing
Turndown for Markdown conversion

Full Changelog: v1.0.0-beta.5...v1.0.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0.0

Choose a tag to compare

Sorry, something went wrong.