Skip to content

v0.5.1

Choose a tag to compare

@addie9800 addie9800 released this 22 Jul 21:02
· 556 commits to master since this release
5e4761f

🌍 Support for 150 Publishers & New Language-Based Search and Corpus Controls 🚀

With this release, Fundus now supports 150 publishers across 30 countries, thanks to the addition of 14 new regions and 24 new publishers!

✨ New Features

As our coverage grows, so does the need for better language and data management—so we’ve introduced two powerful new features:

🔎 Language-Based Publisher Search

You can now filter publishers based on the languages they support. This makes it easier to target specific linguistic corpora or build multilingual datasets.

from fundus import Crawler, PublisherCollection

# Find publishers that support Japanese
filtered_publishers = PublisherCollection.search(languages=["ja"])

# US-based publishers that also offer Spanish content
filtered_publishers = PublisherCollection.us.search(languages=["es"])

crawler = Crawler(*filtered_publishers)
for article in crawler.crawl():
    print(article)

🧮 Balanced Article Crawling

You can now cap the number of articles per publisher during crawling using the new max_articles_per_publisher parameter—ideal for creating balanced datasets.

from fundus import Crawler, PublisherCollection

crawler = Crawler(PublisherCollection.us)

for article in crawler.crawl(max_articles_per_publisher=10, save_to_file="my_corpus.json"):
    print(article)
  • Add max_articles_per_publisher parameter to crawl by @MaxDall in #710

Check out our documentation for more details!

Publishers

This update brings 14 new regions and 24 additional publishers, pushing our total to 150 supported publishers!

Added Regions

Added Publishers

Updated Publishers

Deprecated Publishers

Bug Fixes & Stability

Cleanup & Maintenance

Testing

  • Add unit test if default_language is ISO 639 language code by @MaxDall in #744

New Contributors

Full Changelog: v0.5.0...v0.5.1