v0.5.1
🌍 Support for 150 Publishers & New Language-Based Search and Corpus Controls 🚀
With this release, Fundus now supports 150 publishers across 30 countries, thanks to the addition of 14 new regions and 24 new publishers!
✨ New Features
As our coverage grows, so does the need for better language and data management—so we’ve introduced two powerful new features:
🔎 Language-Based Publisher Search
You can now filter publishers based on the languages they support. This makes it easier to target specific linguistic corpora or build multilingual datasets.
from fundus import Crawler, PublisherCollection
# Find publishers that support Japanese
filtered_publishers = PublisherCollection.search(languages=["ja"])
# US-based publishers that also offer Spanish content
filtered_publishers = PublisherCollection.us.search(languages=["es"])
crawler = Crawler(*filtered_publishers)
for article in crawler.crawl():
print(article)
- Add search by language functionality by @addie9800 in #667
🧮 Balanced Article Crawling
You can now cap the number of articles per publisher during crawling using the new max_articles_per_publisher parameter—ideal for creating balanced datasets.
from fundus import Crawler, PublisherCollection
crawler = Crawler(PublisherCollection.us)
for article in crawler.crawl(max_articles_per_publisher=10, save_to_file="my_corpus.json"):
print(article)
Check out our documentation for more details!
Publishers
This update brings 14 new regions and 24 additional publishers, pushing our total to 150 supported publishers!
Added Regions
- Add
PLby @addie9800 in #698 - Add
PTby @addie9800 in #699 - Add
CZby @horychtom in #725 - Add
MX+ minor bug fixes by @addie9800 in #734 - Add
GLby @addie9800 in #735 - Add
ISLby @addie9800 in #736 - Add
ILby @addie9800 in #737 - Add
PYby @addie9800 in #741 - Add
RUby @addie9800 in #757 - Add
KRby @addie9800 in #758 - Add KR with MBN by @zxxxv in #765
- Add
ZAby @addie9800 in #760 - Add
LSby @addie9800 in #762 - Add
LUby @addie9800 in #775 - Add
LIby @addie9800 in #777
Added Publishers
- Added turkish publisher Anadolu Ajansı by @MSDuran in #722
- Add
Tageszeitungby @addie9800 in #738 - Add
MallorcaMagazinby @addie9800 in #739 - Add
MallorcaZeitungby @addie9800 in #740 - Add
DailyMaverickby @addie9800 in #761 - Add
LuxemburgerWortby @addie9800 in #776 - Add Spanish publishers by @Finiluh in #768
- Add
SalzburgerNachrichtenby @addie9800 in #770 - Add
DiePresseby @addie9800 in #771 - Add
KleineZeitungby @addie9800 in #778
Updated Publishers
- add url_filter to voa by @addie9800 in #715
- add url_filter and RSSFeeds by @addie9800 in #716
- Update
BusinessInsiderby @addie9800 in #717 - Update author extraction for
JyllandsPostenby @addie9800 in #718 - Fix
Focussources by @MaxDall in #732 - Update
FAZparser to versionV3by @MaxDall in #733 - Adjust
ZDFParserto be more suitable for live tickers by @MaxDall in #747 - Modify BBC selectors by @addie9800 in #749
- Update RSSFeed for
Bildby @addie9800 in #752 - Add
V1_1forNationalPostby @addie9800 in #754 - Update Sitemaps for Tanzania by @addie9800 in #755
- Update
_paragraph_selectorforJyllandsPostenby @addie9800 in #756 - Update Tanzanian publishers by @addie9800 in #766
- Change name of
MBNby @addie9800 in #769 - Add
V1_1forNDRby @addie9800 in #773 - Update
Nieuwsbladby @addie9800 in #780
Deprecated Publishers
- Deprecate
TheTelegraphby @MaxDall in #711 - Deprecate
Nikkeiby @addie9800 in #767 - Deprecate
TheNamibianby @addie9800 in #779
Bug Fixes & Stability
- Set
allow_all=Truewhen robots cannot be loaded by @MaxDall in #709 - Add
max_articles_per_publisherparameter tocrawlby @MaxDall in #710 - Extend timeout in publisher coverage by @addie9800 in #712
- Properly release resources by @MaxDall in #713
- Docs: Fix date_filter example by @dallasbrittany in #714
- Bug Fixes - Events by @addie9800 in #719
- Register default
stopevent for WebSource by @MaxDall in #721 - Make network connections interruptible by @MaxDall in #723
- Rework language attribution by @MaxDall in #726
- Make
langattribute deterministic by @MaxDall in #742 - Bug Fix in Source Restriction by @addie9800 in #746
- Bug Fixes from Publisher Coverage by @addie9800 in #753
- Add logging for source restriction by @addie9800 in #774
- Remove duplicate entries in
PublisherCollectionafter merge of #757 by @MaxDall in #781 - Remove Whitespace Normalization in image source parsing by @addie9800 in #692
Cleanup & Maintenance
- Remove leftover ANADOLUAJANSI.json by @MaxDall in #727
- Remove unused imports by @MaxDall in #729
- Update publisher_coverage.yaml by @addie9800 in #750
- Update publisher_coverage.yaml by @addie9800 in #751
Testing
New Contributors
- @dallasbrittany made their first contribution in #714
- @MSDuran made their first contribution in #722
- @horychtom made their first contribution in #725
- @zxxxv made their first contribution in #765
- @Finiluh made their first contribution in #768
Full Changelog: v0.5.0...v0.5.1