scrapex v1.0.0-alpha.1
Pre-release
Pre-release
·
41 commits
to main
since this release
scrapex v1.0.0-alpha.1
First alpha release of scrapex — a modern web scraper with LLM-enhanced extraction, an extensible pipeline, and pluggable parsers.
Highlights
- LLM-optimized output (
contentas Markdown +textContentfor lower tokens) - Provider-agnostic LLM enhancements (OpenAI/Anthropic + OpenAI-compatible servers like Ollama/LM Studio)
- Priority-based, pluggable extraction pipeline (easy to extend)
- Dual build output (ESM + CJS) with TypeScript types
Installation
npm install scrapex@alphaOptional peer dependencies
# For LLM features
npm install openai
npm install @anthropic-ai/sdk
# For JavaScript-rendered pages (via a custom Puppeteer/Playwright fetcher)
npm install puppeteerWhat’s Included
Core scraping API
scrape(url, options?)— fetch + extract metadata and contentscrapeHtml(html, url, options?)— extract from raw HTML without fetching- Configurable
timeout,userAgent,maxContentLength, andextractContent - Robots.txt support via
respectRobotsandcheckRobotsTxt()
Built-in extractors (priority-based)
| Extractor | Priority | Purpose |
|---|---|---|
MetaExtractor |
100 | Open Graph, Twitter Cards, meta tags |
JsonLdExtractor |
80 | JSON-LD structured data |
FaviconExtractor |
70 | Favicon discovery |
ContentExtractor |
50 | Mozilla Readability + Turndown → Markdown |
LinksExtractor |
30 | Link extraction |
LLM integration (optional)
Provider-agnostic LLM support via scrapex/llm:
- OpenAI:
createOpenAI()/OpenAIProvider - Anthropic:
AnthropicProvider - Local / OpenAI-compatible:
createOllama(),createLMStudio()
Enhancements:
summarize— summary generationtags— keywords/tagsentities— named entitiesclassify— content type classification
Structured extraction (schema-driven):
const result = await scrape(url, {
llm,
extract: {
productName: 'string',
price: 'number',
features: 'string[]',
inStock: 'boolean',
},
});Parsers and utilities
scrapex/parsers:MarkdownParser,extractListLinks(), GitHub URL helpers (isGitHubRepo(),parseGitHubUrl(),toRawUrl(), etc.)- URL utilities:
normalizeUrl(),resolveUrl(),extractDomain(),isExternalUrl(), etc. - Typed error handling via
ScrapeError(with retryable classification)
Notes
- This is an alpha release; APIs may change before
v1.0.0. - Requires Node.js 20+.
Full Changelog: https://github.com/developer-rakeshpaul/scrapex/commits/v1.0.0-alpha.1