v1.0.0-beta.4
Pre-release
Pre-release
·
8 commits
to main
since this release
scrapex v1.0.0-beta.4
First-class content normalization for embedding-ready and LLM-optimized text output.
Highlights
- New
normalizeoption inscrape()produces clean, boilerplate-free text directly inScrapedData. - Block-based content classification with customizable classifiers for filtering navigation, footers, and promotional content.
- Embeddings pipeline now prefers
normalizedTextwhen available for higher quality vector representations.
New Features
Content Normalization (normalize option)
const result = await scrape(url, {
normalize: {
mode: 'full', // or 'summary' for score-ranked blocks
removeBoilerplate: true, // filter nav, footer, promos
maxChars: 5000, // truncate at sentence boundary
},
});
console.log(result.normalizedText); // Clean text ready for embedding
console.log(result.normalizationMeta); // Char count, token estimate, hashStandalone Normalization
import { parseBlocks, normalizeText, combineClassifiers, defaultBlockClassifier } from 'scrapex';
const $ = load(html);
const blocks = parseBlocks($);
const result = await normalizeText(blocks, { mode: 'full' });Custom Classifiers
const myClassifier = combineClassifiers(
defaultBlockClassifier,
(block) => {
if (block.text.includes('About the Author')) {
return { accept: false, label: 'author-bio' };
}
return { accept: true };
}
);
const result = await scrape(url, {
normalize: { blockClassifier: myClassifier },
});New Types
ContentBlock- Classified content block with type, text, and contextBlockType- paragraph, heading, list, quote, table, code, media, nav, footer, promo, legalClassifierResult- accept/reject with optional score and labelContentBlockClassifier- sync or async classifier functionNormalizeOptions- full configuration for normalizationNormalizationMeta- charCount, tokenEstimate, hash, blocksTotal/Accepted, truncated
Security
- ReDoS prevention via input limiting (1000 char slice for regex matching)
maxBlocksenforcement (default 2000) prevents unbounded DOM traversal- HTML content only included when
includeHtml: trueis explicitly set - SHA-256 content hashing (128-bit via 32 hex chars) for deduplication
Documentation
- New normalize-text design document with full API reference
- Updated scrape, embeddings, and types documentation
- New example:
examples/22-normalize-text.ts
Installation
npm install scrapex@betaNotes
- Requires Node.js 20+.
normalizedTextis optional and does not affect existingcontent/textContentfields.
Full Changelog: v1.0.0-beta.3...v1.0.0-beta.4