scrapex v1.0.0-beta.4

First-class content normalization for embedding-ready and LLM-optimized text output.

Highlights

New normalize option in scrape() produces clean, boilerplate-free text directly in ScrapedData.
Block-based content classification with customizable classifiers for filtering navigation, footers, and promotional content.
Embeddings pipeline now prefers normalizedText when available for higher quality vector representations.

New Features

Content Normalization (`normalize` option)

const result = await scrape(url, {
  normalize: {
    mode: 'full',           // or 'summary' for score-ranked blocks
    removeBoilerplate: true, // filter nav, footer, promos
    maxChars: 5000,          // truncate at sentence boundary
  },
});

console.log(result.normalizedText);       // Clean text ready for embedding
console.log(result.normalizationMeta);    // Char count, token estimate, hash

Standalone Normalization

import { parseBlocks, normalizeText, combineClassifiers, defaultBlockClassifier } from 'scrapex';

const $ = load(html);
const blocks = parseBlocks($);
const result = await normalizeText(blocks, { mode: 'full' });

Custom Classifiers

const myClassifier = combineClassifiers(
  defaultBlockClassifier,
  (block) => {
    if (block.text.includes('About the Author')) {
      return { accept: false, label: 'author-bio' };
    }
    return { accept: true };
  }
);

const result = await scrape(url, {
  normalize: { blockClassifier: myClassifier },
});

New Types

ContentBlock - Classified content block with type, text, and context
BlockType - paragraph, heading, list, quote, table, code, media, nav, footer, promo, legal
ClassifierResult - accept/reject with optional score and label
ContentBlockClassifier - sync or async classifier function
NormalizeOptions - full configuration for normalization
NormalizationMeta - charCount, tokenEstimate, hash, blocksTotal/Accepted, truncated

Security

ReDoS prevention via input limiting (1000 char slice for regex matching)
maxBlocks enforcement (default 2000) prevents unbounded DOM traversal
HTML content only included when includeHtml: true is explicitly set
SHA-256 content hashing (128-bit via 32 hex chars) for deduplication

Documentation

New normalize-text design document with full API reference
Updated scrape, embeddings, and types documentation
New example: examples/22-normalize-text.ts

Installation

npm install scrapex@beta

Notes

Requires Node.js 20+.
normalizedText is optional and does not affect existing content/textContent fields.

Full Changelog: v1.0.0-beta.3...v1.0.0-beta.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0.0-beta.4

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

scrapex v1.0.0-beta.4

Highlights

New Features

Content Normalization (`normalize` option)

Standalone Normalization

Custom Classifiers

New Types

Security

Documentation

Installation

Notes

Uh oh!

v1.0.0-beta.4

scrapex v1.0.0-beta.4

Highlights

New Features

Content Normalization (normalize option)

Standalone Normalization

Custom Classifiers

New Types

Security

Documentation

Installation

Notes

Uh oh!

Content Normalization (`normalize` option)