Skip to content

v1.0.0-beta.5

Pre-release
Pre-release

Choose a tag to compare

@developer-rakeshpaul developer-rakeshpaul released this 03 Jan 04:26
· 5 commits to main since this release
822d090

scrapex v1.0.0-beta.5

Feed item normalization and enhanced Media RSS support for podcast and media-rich feeds.

Highlights

  • New normalizeFeedItem() helper to convert RSS/Atom item content into clean, embedding-ready text
  • Support for selector@attr syntax in customFields to extract XML attribute values (e.g., media:thumbnail@url)
  • Enhanced Media RSS namespace handling for podcast and media feeds

New Features

Feed Item Normalization

import { RSSParser, normalizeFeedItem } from 'scrapex';

const parser = new RSSParser();
const feed = parser.parse(xml, 'https://example.com/feed.xml');

for (const item of feed.data.items) {
  const normalized = await normalizeFeedItem(item, {
    mode: 'full',
    removeBoilerplate: true,
  });

  console.log(normalized.text);       // Clean text from item.content or item.description
  console.log(normalized.meta);       // charCount, tokenEstimate, hash, etc.
}

Attribute Extraction with selector@attr

Extract XML attribute values from namespaced elements:

const parser = new RSSParser({
  customFields: {
    // Extract url attribute from media:thumbnail element
    thumbnail: 'media\\:thumbnail@url',
    // Extract url attribute from media:content element
    contentUrl: 'media\\:content@url',
  },
});

const result = parser.parse(mediaRssFeed);
console.log(result.data.items[0]?.customFields?.thumbnail);
// => "https://example.com/images/thumbnail.jpg"

Improvements

  • normalizeFeedItem() falls back to plain text extraction when HTML parsing yields no blocks
  • Improved handling of content:encoded CDATA sections
  • Better attribute validation in selector@attr parsing

Test Coverage

  • New rss2-media.xml fixture covering Media RSS namespace scenarios
  • E2e tests for normalizeFeedItem() with various content types
  • E2e tests for selector@attr custom field extraction

Documentation

  • Updated RSS parsing guide with normalization examples
  • Updated API docs for RSSParser and normalizeFeedItem()
  • New example in examples/20-rss-parsing.ts

Installation

npm install scrapex@beta

Notes

  • Requires Node.js 20+.
  • normalizeFeedItem() uses the same normalization pipeline as scrape({ normalize: ... })

Full Changelog: v1.0.0-beta.4...v1.0.0-beta.5