Skip to content

scrapex v1.0.0-alpha.1

Pre-release
Pre-release

Choose a tag to compare

@developer-rakeshpaul developer-rakeshpaul released this 14 Dec 13:30
· 41 commits to main since this release

scrapex v1.0.0-alpha.1

First alpha release of scrapex — a modern web scraper with LLM-enhanced extraction, an extensible pipeline, and pluggable parsers.

Highlights

  • LLM-optimized output (content as Markdown + textContent for lower tokens)
  • Provider-agnostic LLM enhancements (OpenAI/Anthropic + OpenAI-compatible servers like Ollama/LM Studio)
  • Priority-based, pluggable extraction pipeline (easy to extend)
  • Dual build output (ESM + CJS) with TypeScript types

Installation

npm install scrapex@alpha

Optional peer dependencies

# For LLM features
npm install openai
npm install @anthropic-ai/sdk

# For JavaScript-rendered pages (via a custom Puppeteer/Playwright fetcher)
npm install puppeteer

What’s Included

Core scraping API

  • scrape(url, options?) — fetch + extract metadata and content
  • scrapeHtml(html, url, options?) — extract from raw HTML without fetching
  • Configurable timeout, userAgent, maxContentLength, and extractContent
  • Robots.txt support via respectRobots and checkRobotsTxt()

Built-in extractors (priority-based)

Extractor Priority Purpose
MetaExtractor 100 Open Graph, Twitter Cards, meta tags
JsonLdExtractor 80 JSON-LD structured data
FaviconExtractor 70 Favicon discovery
ContentExtractor 50 Mozilla Readability + Turndown → Markdown
LinksExtractor 30 Link extraction

LLM integration (optional)

Provider-agnostic LLM support via scrapex/llm:

  • OpenAI: createOpenAI() / OpenAIProvider
  • Anthropic: AnthropicProvider
  • Local / OpenAI-compatible: createOllama(), createLMStudio()

Enhancements:

  • summarize — summary generation
  • tags — keywords/tags
  • entities — named entities
  • classify — content type classification

Structured extraction (schema-driven):

const result = await scrape(url, {
  llm,
  extract: {
    productName: 'string',
    price: 'number',
    features: 'string[]',
    inStock: 'boolean',
  },
});

Parsers and utilities

  • scrapex/parsers: MarkdownParser, extractListLinks(), GitHub URL helpers (isGitHubRepo(), parseGitHubUrl(), toRawUrl(), etc.)
  • URL utilities: normalizeUrl(), resolveUrl(), extractDomain(), isExternalUrl(), etc.
  • Typed error handling via ScrapeError (with retryable classification)

Notes

  • This is an alpha release; APIs may change before v1.0.0.
  • Requires Node.js 20+.

Full Changelog: https://github.com/developer-rakeshpaul/scrapex/commits/v1.0.0-alpha.1