·
4 commits
to main
since this release
scrapex v1.0.0
The first stable release of scrapex — a modern web scraper with LLM-enhanced extraction, vector embeddings, content normalization, and comprehensive feed parsing.
Installation
npm install scrapexOptional Peer Dependencies
# For LLM features
npm install openai # OpenAI/Ollama/LM Studio
npm install @anthropic-ai/sdk # Anthropic Claude
# For local embeddings (zero API cost)
npm install @huggingface/transformers onnxruntime-node
# For JavaScript-rendered pages
npm install puppeteerHighlights
- LLM-Ready Output — Content extracted as Markdown, optimized for AI consumption
- Vector Embeddings — Generate embeddings with OpenAI, Azure, Cohere, HuggingFace, Ollama, or local Transformers.js
- Content Normalization — Clean, boilerplate-free text for embeddings and RAG pipelines
- RSS/Atom Feeds — Parse RSS 2.0, RSS 1.0 (RDF), and Atom feeds with Media RSS support
- Provider-Agnostic LLM — Works with OpenAI, Anthropic, Ollama, LM Studio, or any OpenAI-compatible API
- Extensible Pipeline — Priority-based pluggable extractors
- TypeScript First — Full type safety with comprehensive exports
- Dual Format — ESM and CommonJS builds
Core Features
Web Scraping
import { scrape, scrapeHtml } from 'scrapex';
// Fetch and extract
const result = await scrape('https://example.com/article', {
timeout: 10000,
extractContent: true,
respectRobots: true,
});
console.log(result.title); // Page title
console.log(result.content); // Markdown content
console.log(result.textContent); // Plain text
console.log(result.excerpt); // ~300 char preview
// Extract from raw HTML
const html = await fetchSomehow('https://example.com');
const result = await scrapeHtml(html, 'https://example.com');Content Normalization
Clean, embedding-ready text with boilerplate removal and block classification:
const result = await scrape(url, {
normalize: {
mode: 'full', // or 'summary' for score-ranked blocks
removeBoilerplate: true, // filter nav, footer, promos
maxChars: 5000, // truncate at sentence boundary
},
});
console.log(result.normalizedText); // Clean text ready for embedding
console.log(result.normalizationMeta); // charCount, tokenEstimate, hashStandalone Normalization
import { parseBlocks, normalizeText, defaultBlockClassifier, combineClassifiers } from 'scrapex';
import { load } from 'cheerio';
const $ = load(html);
const blocks = parseBlocks($);
const result = await normalizeText(blocks, { mode: 'full' });Custom Classifiers
const myClassifier = combineClassifiers(
defaultBlockClassifier,
(block) => {
if (block.text.includes('About the Author')) {
return { accept: false, label: 'author-bio' };
}
return { accept: true };
}
);
const result = await scrape(url, {
normalize: { blockClassifier: myClassifier },
});Vector Embeddings
Generate embeddings from scraped content for semantic search, RAG, and similarity matching:
import { scrape } from 'scrapex';
import { createOpenAIEmbedding } from 'scrapex/embeddings';
const result = await scrape('https://example.com/article', {
embeddings: {
provider: { type: 'custom', provider: createOpenAIEmbedding() },
model: 'text-embedding-3-small',
},
});
if (result.embeddings?.status === 'success') {
console.log(result.embeddings.vector); // [0.023, -0.041, ...]
}Embedding Providers
| Provider | Factory Function | Notes |
|---|---|---|
| OpenAI | createOpenAIEmbedding() |
text-embedding-3-small/large |
| Azure OpenAI | createAzureEmbedding() |
Enterprise deployments |
| Cohere | createCohereEmbedding() |
embed-english-v3.0 |
| HuggingFace | createHuggingFaceEmbedding() |
Inference API |
| Ollama | createOllamaEmbedding() |
Local models |
| Transformers.js | createTransformersEmbedding() |
Zero-cost local inference |
Embedding Features
- Optional PII redaction (email, phone, SSN, credit card, IP)
- Smart chunking with configurable size/overlap
- Aggregation modes:
average,max,first,all - Content-addressable caching
- Resilience: retry, circuit breaker, rate limiting
- SSRF protection for custom endpoints
Standalone Embedding Functions
import { embed, embedScrapedData } from 'scrapex/embeddings';
// Embed arbitrary text
const result = await embed('Your text here', { provider, model });
// Embed previously scraped data
const embedResult = await embedScrapedData(scrapedData, { provider });RSS/Atom Feed Parsing
Parse RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds:
import { RSSParser, fetchFeed, paginateFeed, normalizeFeedItem } from 'scrapex';
// Parse feed XML
const parser = new RSSParser();
const result = parser.parse(feedXml, 'https://example.com/feed.xml');
console.log(result.data.format); // 'rss2' | 'rss1' | 'atom'
console.log(result.data.title); // Feed title
console.log(result.data.items); // Array of feed items
// Fetch and parse in one call
const feed = await fetchFeed('https://example.com/feed.xml');
// Paginate through feeds with rel="next" links (RFC 5005)
for await (const page of paginateFeed('https://example.com/atom')) {
console.log(`Page with ${page.items.length} items`);
}
// Normalize feed items for embeddings
for (const item of feed.data.items) {
const normalized = await normalizeFeedItem(item, { mode: 'full' });
console.log(normalized.text); // Clean text from content/description
}Feed Utilities
import {
discoverFeeds, // Find feed URLs in HTML
filterByDate, // Filter items by date range
feedToMarkdown, // Convert feed to markdown (LLM-ready)
feedToText, // Convert feed to plain text
} from 'scrapex';Media RSS & Custom Fields
const parser = new RSSParser({
customFields: {
// iTunes podcast fields
duration: 'itunes\\:duration',
explicit: 'itunes\\:explicit',
// Media RSS with attribute extraction
thumbnail: 'media\\:thumbnail@url',
contentUrl: 'media\\:content@url',
},
});
const result = parser.parse(podcastFeed);
console.log(result.data.items[0]?.customFields?.thumbnail);
// => "https://example.com/images/thumbnail.jpg"LLM Integration
Provider-agnostic LLM support for content enhancement:
import { scrape } from 'scrapex';
import { createOpenAI } from 'scrapex/llm';
const llm = createOpenAI({ apiKey: process.env.OPENAI_API_KEY });
const result = await scrape(url, {
llm,
enhance: ['summarize', 'tags', 'entities', 'classify'],
});
console.log(result.summary); // AI-generated summary
console.log(result.tags); // Extracted keywords
console.log(result.entities); // Named entities
console.log(result.category); // Content classificationStructured Extraction
const result = await scrape(url, {
llm,
extract: {
productName: 'string',
price: 'number',
features: 'string[]',
inStock: 'boolean',
},
});
console.log(result.extracted); // { productName, price, features, inStock }Built-in Extractors
| Extractor | Priority | Purpose |
|---|---|---|
MetaExtractor |
100 | Open Graph, Twitter Cards, meta |
JsonLdExtractor |
80 | JSON-LD structured data |
FaviconExtractor |
70 | Favicon discovery |
ContentExtractor |
50 | Readability + Turndown → Markdown |
LinksExtractor |
30 | Link extraction |
URL Utilities
import {
normalizeUrl,
resolveUrl,
extractDomain,
isExternalUrl,
isValidUrl,
matchesUrlPattern,
getProtocol,
getPath,
} from 'scrapex';Types
Content Types
type BlockType =
| 'paragraph' | 'heading' | 'list' | 'quote'
| 'table' | 'code' | 'media' | 'nav'
| 'footer' | 'promo' | 'legal' | 'unknown';
interface ContentBlock {
type: BlockType;
text: string;
level?: 1 | 2 | 3 | 4 | 5 | 6;
html?: string;
attrs?: Record<string, string>;
}
interface NormalizationMeta {
charCount: number;
tokenEstimate: number;
language: string;
hash: string;
blocksTotal: number;
blocksAccepted: number;
truncated: boolean;
}Feed Types
interface FeedItem {
id: string;
title: string;
link: string;
description?: string;
content?: string;
author?: string;
publishedAt?: string; // ISO 8601
updatedAt?: string;
categories: string[];
enclosure?: FeedEnclosure;
customFields?: Record<string, string>;
}
interface ParsedFeed {
format: 'rss2' | 'rss1' | 'atom';
title: string;
description?: string;
link: string;
next?: string; // RFC 5005 pagination
items: FeedItem[];
}Security
- HTTPS-only URL resolution in feeds
- XML mode parsing (prevents XSS vectors)
- ReDoS prevention via input limiting
maxBlocksenforcement (default 2000)- PII redaction before embedding API calls
- SSRF protection for custom endpoints
- SHA-256 content hashing for deduplication
Requirements
- Node.js 20+
Migration from Beta
No breaking changes from v1.0.0-beta.5. Update your installation:
npm install scrapex@latestAcknowledgments
Built with:
- Mozilla Readability for content extraction
- Cheerio for HTML parsing
- Turndown for Markdown conversion
Full Changelog: v1.0.0-beta.5...v1.0.0