Skip to content

v1.0.0

Latest

Choose a tag to compare

@developer-rakeshpaul developer-rakeshpaul released this 08 Jan 05:26
· 4 commits to main since this release

scrapex v1.0.0

The first stable release of scrapex — a modern web scraper with LLM-enhanced extraction, vector embeddings, content normalization, and comprehensive feed parsing.

Installation

npm install scrapex

Optional Peer Dependencies

# For LLM features
npm install openai              # OpenAI/Ollama/LM Studio
npm install @anthropic-ai/sdk   # Anthropic Claude

# For local embeddings (zero API cost)
npm install @huggingface/transformers onnxruntime-node

# For JavaScript-rendered pages
npm install puppeteer

Highlights

  • LLM-Ready Output — Content extracted as Markdown, optimized for AI consumption
  • Vector Embeddings — Generate embeddings with OpenAI, Azure, Cohere, HuggingFace, Ollama, or local Transformers.js
  • Content Normalization — Clean, boilerplate-free text for embeddings and RAG pipelines
  • RSS/Atom Feeds — Parse RSS 2.0, RSS 1.0 (RDF), and Atom feeds with Media RSS support
  • Provider-Agnostic LLM — Works with OpenAI, Anthropic, Ollama, LM Studio, or any OpenAI-compatible API
  • Extensible Pipeline — Priority-based pluggable extractors
  • TypeScript First — Full type safety with comprehensive exports
  • Dual Format — ESM and CommonJS builds

Core Features

Web Scraping

import { scrape, scrapeHtml } from 'scrapex';

// Fetch and extract
const result = await scrape('https://example.com/article', {
  timeout: 10000,
  extractContent: true,
  respectRobots: true,
});

console.log(result.title);       // Page title
console.log(result.content);     // Markdown content
console.log(result.textContent); // Plain text
console.log(result.excerpt);     // ~300 char preview

// Extract from raw HTML
const html = await fetchSomehow('https://example.com');
const result = await scrapeHtml(html, 'https://example.com');

Content Normalization

Clean, embedding-ready text with boilerplate removal and block classification:

const result = await scrape(url, {
  normalize: {
    mode: 'full',            // or 'summary' for score-ranked blocks
    removeBoilerplate: true, // filter nav, footer, promos
    maxChars: 5000,          // truncate at sentence boundary
  },
});

console.log(result.normalizedText);    // Clean text ready for embedding
console.log(result.normalizationMeta); // charCount, tokenEstimate, hash

Standalone Normalization

import { parseBlocks, normalizeText, defaultBlockClassifier, combineClassifiers } from 'scrapex';
import { load } from 'cheerio';

const $ = load(html);
const blocks = parseBlocks($);
const result = await normalizeText(blocks, { mode: 'full' });

Custom Classifiers

const myClassifier = combineClassifiers(
  defaultBlockClassifier,
  (block) => {
    if (block.text.includes('About the Author')) {
      return { accept: false, label: 'author-bio' };
    }
    return { accept: true };
  }
);

const result = await scrape(url, {
  normalize: { blockClassifier: myClassifier },
});

Vector Embeddings

Generate embeddings from scraped content for semantic search, RAG, and similarity matching:

import { scrape } from 'scrapex';
import { createOpenAIEmbedding } from 'scrapex/embeddings';

const result = await scrape('https://example.com/article', {
  embeddings: {
    provider: { type: 'custom', provider: createOpenAIEmbedding() },
    model: 'text-embedding-3-small',
  },
});

if (result.embeddings?.status === 'success') {
  console.log(result.embeddings.vector); // [0.023, -0.041, ...]
}

Embedding Providers

Provider Factory Function Notes
OpenAI createOpenAIEmbedding() text-embedding-3-small/large
Azure OpenAI createAzureEmbedding() Enterprise deployments
Cohere createCohereEmbedding() embed-english-v3.0
HuggingFace createHuggingFaceEmbedding() Inference API
Ollama createOllamaEmbedding() Local models
Transformers.js createTransformersEmbedding() Zero-cost local inference

Embedding Features

  • Optional PII redaction (email, phone, SSN, credit card, IP)
  • Smart chunking with configurable size/overlap
  • Aggregation modes: average, max, first, all
  • Content-addressable caching
  • Resilience: retry, circuit breaker, rate limiting
  • SSRF protection for custom endpoints

Standalone Embedding Functions

import { embed, embedScrapedData } from 'scrapex/embeddings';

// Embed arbitrary text
const result = await embed('Your text here', { provider, model });

// Embed previously scraped data
const embedResult = await embedScrapedData(scrapedData, { provider });

RSS/Atom Feed Parsing

Parse RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds:

import { RSSParser, fetchFeed, paginateFeed, normalizeFeedItem } from 'scrapex';

// Parse feed XML
const parser = new RSSParser();
const result = parser.parse(feedXml, 'https://example.com/feed.xml');

console.log(result.data.format); // 'rss2' | 'rss1' | 'atom'
console.log(result.data.title);  // Feed title
console.log(result.data.items);  // Array of feed items

// Fetch and parse in one call
const feed = await fetchFeed('https://example.com/feed.xml');

// Paginate through feeds with rel="next" links (RFC 5005)
for await (const page of paginateFeed('https://example.com/atom')) {
  console.log(`Page with ${page.items.length} items`);
}

// Normalize feed items for embeddings
for (const item of feed.data.items) {
  const normalized = await normalizeFeedItem(item, { mode: 'full' });
  console.log(normalized.text); // Clean text from content/description
}

Feed Utilities

import {
  discoverFeeds,   // Find feed URLs in HTML
  filterByDate,    // Filter items by date range
  feedToMarkdown,  // Convert feed to markdown (LLM-ready)
  feedToText,      // Convert feed to plain text
} from 'scrapex';

Media RSS & Custom Fields

const parser = new RSSParser({
  customFields: {
    // iTunes podcast fields
    duration: 'itunes\\:duration',
    explicit: 'itunes\\:explicit',
    // Media RSS with attribute extraction
    thumbnail: 'media\\:thumbnail@url',
    contentUrl: 'media\\:content@url',
  },
});

const result = parser.parse(podcastFeed);
console.log(result.data.items[0]?.customFields?.thumbnail);
// => "https://example.com/images/thumbnail.jpg"

LLM Integration

Provider-agnostic LLM support for content enhancement:

import { scrape } from 'scrapex';
import { createOpenAI } from 'scrapex/llm';

const llm = createOpenAI({ apiKey: process.env.OPENAI_API_KEY });

const result = await scrape(url, {
  llm,
  enhance: ['summarize', 'tags', 'entities', 'classify'],
});

console.log(result.summary);    // AI-generated summary
console.log(result.tags);       // Extracted keywords
console.log(result.entities);   // Named entities
console.log(result.category);   // Content classification

Structured Extraction

const result = await scrape(url, {
  llm,
  extract: {
    productName: 'string',
    price: 'number',
    features: 'string[]',
    inStock: 'boolean',
  },
});

console.log(result.extracted); // { productName, price, features, inStock }

Built-in Extractors

Extractor Priority Purpose
MetaExtractor 100 Open Graph, Twitter Cards, meta
JsonLdExtractor 80 JSON-LD structured data
FaviconExtractor 70 Favicon discovery
ContentExtractor 50 Readability + Turndown → Markdown
LinksExtractor 30 Link extraction

URL Utilities

import {
  normalizeUrl,
  resolveUrl,
  extractDomain,
  isExternalUrl,
  isValidUrl,
  matchesUrlPattern,
  getProtocol,
  getPath,
} from 'scrapex';

Types

Content Types

type BlockType =
  | 'paragraph' | 'heading' | 'list' | 'quote'
  | 'table' | 'code' | 'media' | 'nav'
  | 'footer' | 'promo' | 'legal' | 'unknown';

interface ContentBlock {
  type: BlockType;
  text: string;
  level?: 1 | 2 | 3 | 4 | 5 | 6;
  html?: string;
  attrs?: Record<string, string>;
}

interface NormalizationMeta {
  charCount: number;
  tokenEstimate: number;
  language: string;
  hash: string;
  blocksTotal: number;
  blocksAccepted: number;
  truncated: boolean;
}

Feed Types

interface FeedItem {
  id: string;
  title: string;
  link: string;
  description?: string;
  content?: string;
  author?: string;
  publishedAt?: string;      // ISO 8601
  updatedAt?: string;
  categories: string[];
  enclosure?: FeedEnclosure;
  customFields?: Record<string, string>;
}

interface ParsedFeed {
  format: 'rss2' | 'rss1' | 'atom';
  title: string;
  description?: string;
  link: string;
  next?: string;             // RFC 5005 pagination
  items: FeedItem[];
}

Security

  • HTTPS-only URL resolution in feeds
  • XML mode parsing (prevents XSS vectors)
  • ReDoS prevention via input limiting
  • maxBlocks enforcement (default 2000)
  • PII redaction before embedding API calls
  • SSRF protection for custom endpoints
  • SHA-256 content hashing for deduplication

Requirements

  • Node.js 20+

Migration from Beta

No breaking changes from v1.0.0-beta.5. Update your installation:

npm install scrapex@latest

Acknowledgments

Built with:


Full Changelog: v1.0.0-beta.5...v1.0.0