v0.4.3
Introducing New Publishers from Canada, Germany, and India ๐
This release includes:
- Support for five new publishers (three from Canada, one from India, and one from Germany)
- Article filtering based on
robots.txt
New Features
With this update, we've implemented article filtering using robots.txt. Each URL fetched is now evaluated against the path and user-agent restrictions specified by publishers in their robots.txt files. This feature is enabled by default, but users can disable it by setting ignore_robots=True in the Crawler constructor.
New Publishers
Canada (CA)
- Introduced CBC as the first Canadian publisher by @addie9800 in #583
- Added
NationalPostby @addie9800 in #584 - Included The Globe and Mail by @addie9800 in #587
India (IND)
- Added
Times Of Indiaby @addie9800 in #569
Germany (DE)
Updates
We've updated our APNews parser to accurately parse authors once more and applied additional fixes.
Bug Fixes
- Protected key access for RSSFeed entries by @MaxDall in #599
- Fixed an issue in test file generation by @addie9800 in #597
Full Changelog: v0.4.2...v0.4.3