scrapex v1.0.0-alpha.1

First alpha release of scrapex — a modern web scraper with LLM-enhanced extraction, an extensible pipeline, and pluggable parsers.

Highlights

LLM-optimized output (content as Markdown + textContent for lower tokens)
Provider-agnostic LLM enhancements (OpenAI/Anthropic + OpenAI-compatible servers like Ollama/LM Studio)
Priority-based, pluggable extraction pipeline (easy to extend)
Dual build output (ESM + CJS) with TypeScript types

Installation

npm install scrapex@alpha

Optional peer dependencies

# For LLM features
npm install openai
npm install @anthropic-ai/sdk

# For JavaScript-rendered pages (via a custom Puppeteer/Playwright fetcher)
npm install puppeteer

What’s Included

Core scraping API

scrape(url, options?) — fetch + extract metadata and content
scrapeHtml(html, url, options?) — extract from raw HTML without fetching
Configurable timeout, userAgent, maxContentLength, and extractContent
Robots.txt support via respectRobots and checkRobotsTxt()

Built-in extractors (priority-based)

Extractor	Priority	Purpose
`MetaExtractor`	100	Open Graph, Twitter Cards, meta tags
`JsonLdExtractor`	80	JSON-LD structured data
`FaviconExtractor`	70	Favicon discovery
`ContentExtractor`	50	Mozilla Readability + Turndown → Markdown
`LinksExtractor`	30	Link extraction

LLM integration (optional)

Provider-agnostic LLM support via scrapex/llm:

OpenAI: createOpenAI() / OpenAIProvider
Anthropic: AnthropicProvider
Local / OpenAI-compatible: createOllama(), createLMStudio()

Enhancements:

summarize — summary generation
tags — keywords/tags
entities — named entities
classify — content type classification

Structured extraction (schema-driven):

const result = await scrape(url, {
  llm,
  extract: {
    productName: 'string',
    price: 'number',
    features: 'string[]',
    inStock: 'boolean',
  },
});

Parsers and utilities

scrapex/parsers: MarkdownParser, extractListLinks(), GitHub URL helpers (isGitHubRepo(), parseGitHubUrl(), toRawUrl(), etc.)
URL utilities: normalizeUrl(), resolveUrl(), extractDomain(), isExternalUrl(), etc.
Typed error handling via ScrapeError (with retryable classification)

Notes

This is an alpha release; APIs may change before v1.0.0.
Requires Node.js 20+.

Full Changelog: https://github.com/developer-rakeshpaul/scrapex/commits/v1.0.0-alpha.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scrapex v1.0.0-alpha.1

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

scrapex v1.0.0-alpha.1

Highlights

Installation

Optional peer dependencies

What’s Included

Core scraping API

Built-in extractors (priority-based)

LLM integration (optional)

Parsers and utilities

Notes

Uh oh!