v1.0.0-beta.5
Pre-release
Pre-release
·
5 commits
to main
since this release
scrapex v1.0.0-beta.5
Feed item normalization and enhanced Media RSS support for podcast and media-rich feeds.
Highlights
- New
normalizeFeedItem()helper to convert RSS/Atom item content into clean, embedding-ready text - Support for
selector@attrsyntax in customFields to extract XML attribute values (e.g.,media:thumbnail@url) - Enhanced Media RSS namespace handling for podcast and media feeds
New Features
Feed Item Normalization
import { RSSParser, normalizeFeedItem } from 'scrapex';
const parser = new RSSParser();
const feed = parser.parse(xml, 'https://example.com/feed.xml');
for (const item of feed.data.items) {
const normalized = await normalizeFeedItem(item, {
mode: 'full',
removeBoilerplate: true,
});
console.log(normalized.text); // Clean text from item.content or item.description
console.log(normalized.meta); // charCount, tokenEstimate, hash, etc.
}Attribute Extraction with selector@attr
Extract XML attribute values from namespaced elements:
const parser = new RSSParser({
customFields: {
// Extract url attribute from media:thumbnail element
thumbnail: 'media\\:thumbnail@url',
// Extract url attribute from media:content element
contentUrl: 'media\\:content@url',
},
});
const result = parser.parse(mediaRssFeed);
console.log(result.data.items[0]?.customFields?.thumbnail);
// => "https://example.com/images/thumbnail.jpg"Improvements
normalizeFeedItem()falls back to plain text extraction when HTML parsing yields no blocks- Improved handling of
content:encodedCDATA sections - Better attribute validation in
selector@attrparsing
Test Coverage
- New
rss2-media.xmlfixture covering Media RSS namespace scenarios - E2e tests for
normalizeFeedItem()with various content types - E2e tests for
selector@attrcustom field extraction
Documentation
- Updated RSS parsing guide with normalization examples
- Updated API docs for
RSSParserandnormalizeFeedItem() - New example in
examples/20-rss-parsing.ts
Installation
npm install scrapex@betaNotes
- Requires Node.js 20+.
normalizeFeedItem()uses the same normalization pipeline asscrape({ normalize: ... })
Full Changelog: v1.0.0-beta.4...v1.0.0-beta.5