v0.5.0
π Get millions of labeled images in just a few hours* π
This release adds image extraction and new publishers, updates existing ones, and fixes several bugs.
*Testing involved crawling 1 million images including at least a caption or description, which took 1 hour and 20 minutes. This was done on a machine using 10Gbit/s bandwidth and the CC-NEWS crawl running with 50 processes. Results may vary based on the use case and bandwidth.
Image Extraction
Thanks to @addie9800, Fundus now provides image extraction for most of our publishers. Each crawled article automatically parses image links and metadata, allowing users to retrieve millions of labeled images in just a few hours. Parsed images include the caption, description, author, and various image versions (sorted by size).

Language distribution of one million crawled images, excluding languages with fewer than 1000 entries images
Check out our supported publishers to find out which publishers are supported.
New Publishers for it, ch, jp, es, dk, tz, be
With this major release, Fundus now offers support for 124 publishers from 22 different countries
IT
- Initial support for Italian publishers, starting with La Repubblica by @ruggsea in #670
- add
CorriereDellaSeraby @addie9800 in #677 - Support for 2 new italian newspapers - Corriere della Sera & Il Giornale by @ruggsea in #700
CH
JP
- Add
Taipei Timesby @MaxDall in #674 - Add
AsahiShimbunby @MaxDall in #682 - Add
ChunichiShimbunandTokyoShimbunby @MaxDall in #683 - Add
MainichiShimbunby @MaxDall in #685 - add
Nikkeiby @MaxDall in #686 - Add
SankeiShimbunby @MaxDall in #688 - Add
NikkanGeadaiby @MaxDall in #689
ES
- Add
El Mundoby @MaxDall in #675 - Add
ABCby @addie9800 in #681 - Add
LaVanguardiaby @addie9800 in #684
DK
- Add
DKby @addie9800 in #696
TZ
- Add Tanzanian Publishers by @addie9800 in #691
BE
- Add
BEby @addie9800 in #697
Update Publishers
- Update
FreiePresseby @addie9800 in #663 - Fix
Metroby @addie9800 in #665 - Update
BoersenZeitungparser by @MaxDall in #666 - Update BBC by @addie9800 in #668
- Layout Change
SRFby @addie9800 in #680 - Add parser
v1_1-iNewsby @addie9800 in #693 - Update
Dagbladetby @addie9800 in #695
Bug fixes
- Reraise exceptions in main thread when error handling is set to
raiseby @MaxDall in #662 - Fix a bug returning
Nonefor empty values inxpath_searchby @MaxDall in #671 - Add
ISTto tzinfo by @MaxDall in #690 - Fix article serialization for
imagesby @MaxDall in #703
Improvements
New Contributors
Full Changelog: v0.4.6...v0.5.0