Skip to content

v1.0.0-beta.4

Pre-release
Pre-release

Choose a tag to compare

@developer-rakeshpaul developer-rakeshpaul released this 02 Jan 15:51
· 8 commits to main since this release
84a673b

scrapex v1.0.0-beta.4

First-class content normalization for embedding-ready and LLM-optimized text output.

Highlights

  • New normalize option in scrape() produces clean, boilerplate-free text directly in ScrapedData.
  • Block-based content classification with customizable classifiers for filtering navigation, footers, and promotional content.
  • Embeddings pipeline now prefers normalizedText when available for higher quality vector representations.

New Features

Content Normalization (normalize option)

const result = await scrape(url, {
  normalize: {
    mode: 'full',           // or 'summary' for score-ranked blocks
    removeBoilerplate: true, // filter nav, footer, promos
    maxChars: 5000,          // truncate at sentence boundary
  },
});

console.log(result.normalizedText);       // Clean text ready for embedding
console.log(result.normalizationMeta);    // Char count, token estimate, hash

Standalone Normalization

import { parseBlocks, normalizeText, combineClassifiers, defaultBlockClassifier } from 'scrapex';

const $ = load(html);
const blocks = parseBlocks($);
const result = await normalizeText(blocks, { mode: 'full' });

Custom Classifiers

const myClassifier = combineClassifiers(
  defaultBlockClassifier,
  (block) => {
    if (block.text.includes('About the Author')) {
      return { accept: false, label: 'author-bio' };
    }
    return { accept: true };
  }
);

const result = await scrape(url, {
  normalize: { blockClassifier: myClassifier },
});

New Types

  • ContentBlock - Classified content block with type, text, and context
  • BlockType - paragraph, heading, list, quote, table, code, media, nav, footer, promo, legal
  • ClassifierResult - accept/reject with optional score and label
  • ContentBlockClassifier - sync or async classifier function
  • NormalizeOptions - full configuration for normalization
  • NormalizationMeta - charCount, tokenEstimate, hash, blocksTotal/Accepted, truncated

Security

  • ReDoS prevention via input limiting (1000 char slice for regex matching)
  • maxBlocks enforcement (default 2000) prevents unbounded DOM traversal
  • HTML content only included when includeHtml: true is explicitly set
  • SHA-256 content hashing (128-bit via 32 hex chars) for deduplication

Documentation

  • New normalize-text design document with full API reference
  • Updated scrape, embeddings, and types documentation
  • New example: examples/22-normalize-text.ts

Installation

npm install scrapex@beta

Notes

  • Requires Node.js 20+.
  • normalizedText is optional and does not affect existing content/textContent fields.

Full Changelog: v1.0.0-beta.3...v1.0.0-beta.4