This tutorial shows you how to filter articles based on their attribute values, URLs, and news sources.
A specific article may not contain all attributes the parser is capable of extracting.
By default, Fundus drops all articles without at least a title, body, and publishing date extracted to ensure data quality.
To alter this behavior make use of the only_complete
parameter of the crawl()
method.
You have three options to do so:
- Use the build in
ExtractionFilter
Requires
, or write a custome one. - Set it to
false
to disable extraction filtering entirely. - Set it to
true
to yield only fully extracted articles.
Let's print some articles with at least a body
, title
, and topics
set.
from fundus import Crawler, PublisherCollection, Requires
crawler = Crawler(PublisherCollection.us)
for article in crawler.crawl(max_articles=2, only_complete=Requires("title", "body", "topics")):
print(article)
NOTE: We recommend thinking about what kind of data is needed first and then running Fundus with a configured extraction filter afterward.
Writing custom extraction filters is supported and encouraged.
To do so you need to define a Callable
satisfying the ExtractionFilter
protocol which you can find here.
Let's build a custom extraction filter that rejects all articles which do not contain the word string usa
.
from typing import Dict, Any
def topic_filter(extracted: Dict[str, Any]) -> bool:
if topics := extracted.get("topics"):
if "usa" in [topic.casefold() for topic in topics]:
return False
return True
and put it to work.
from fundus import Crawler, PublisherCollection
crawler = Crawler(PublisherCollection.us)
for us_themed_article in crawler.crawl(only_complete=topic_filter):
print(us_themed_article)
NOTE: Fundus' filters work inversely to Python's built-in filter. A filter in Fundus describes what is filtered out and not what's kept. If a filter returns True on a specific element the element will be dropped.
# only select articles from the past seven days
def date_filter(extracted: Dict[str, Any]) -> bool:
end_date = datetime.date.today() - datetime.timedelta(weeks=1)
start_date = end_date - datetime.timedelta(weeks=1)
if publishing_date := extracted.get("publishing_date"):
return not (start_date <= publishing_date.date() <= end_date)
return True
# only select articles which include at least one string from ["pollution", "climate crisis"] in the article body
def body_filter(extracted: Dict[str, Any]) -> bool:
if body := extracted.get("body"):
for word in ["pollution", "climate crisis"]:
if word in str(body).casefold():
return False
return True
Fundus allows you to filter articles by URL both before and after the HTML is downloaded.
You do so by making use of the url_filter
parameter of the crawl()
method.
There is a built-in filter called regex_filter
that filters out URLs based on regular expressions.
Let's crawl a bunch of articles with URLs not including the word advertisement
or podcast
and print their requested_url
's.
from fundus import Crawler, PublisherCollection
from fundus.scraping.filter import regex_filter
crawler = Crawler(PublisherCollection.us)
for article in crawler.crawl(max_articles=5, url_filter=regex_filter("advertisement|podcast")):
print(article.html.requested_url)
Often it's useful to select certain criteria rather than filtering them.
To do so use the inverse
operator from fundus.scraping.filter.py
.
Let's crawl a bunch of articles with URLs including the string politic
.
from fundus import Crawler, PublisherCollection
from fundus.scraping.filter import inverse, regex_filter
crawler = Crawler(PublisherCollection.us)
for article in crawler.crawl(max_articles=5, url_filter=inverse(regex_filter("politic"))):
print(article.html.requested_url)
Which should print something like this:
https://www.foxnews.com/politics/matt-gaetz-fisa-surveillance-citizens
https://www.thenation.com/article/politics/jim-jordan-chris-wray/
https://www.newyorker.com/podcast/political-scene/will-record-temperatures-finally-force-political-change
https://www.cnbc.com/2023/07/12/thai-elections-deep-generational-divides-belie-thailands-politics.html
https://www.reuters.com/business/autos-transportation/volkswagens-china-chief-welcomes-political-goal-germanys-beijing-strategy-2023-07-13/
NOTE: As with the ExtractionFilter
you can also write custom URL filters satisfying the URLFilter
protocol.
Sometimes it is useful to combine filters of the same kind.
You can do so by using the lor
(logic or
) and land
(logic and
) operators from fundus.scraping.filter.py
.
Let's combine both URL filters from the examples above and add a new condition. Our goal is to get articles that include both strings 'politic' and 'trump' in their URL and don't include the strings 'podcast' or 'advertisement'.
from fundus import Crawler, PublisherCollection
from fundus.scraping.filter import inverse, regex_filter, lor, land
crawler = Crawler(PublisherCollection.us)
filter1 = regex_filter("advertisement|podcast") # drop all URLs including the strings "advertisement" or "podcast"
filter2 = inverse(land(regex_filter("politic"), regex_filter("trump"))) # drop all URLs not including the strings "politic" and "trump"
for article in crawler.crawl(max_articles=10, url_filter=lor(filter1, filter2)):
print(article.html.requested_url)
Which should print something like this:
https://www.foxnews.com/politics/desantis-meet-donors-new-yorks-southampton-next-week-pitch-campaigns-long-game-trump
https://occupydemocrats.com/2023/06/25/the-grift-trump-shifting-political-contributions-from-his-campaign-to-defense-fund/
https://www.foxnews.com/politics/trump-slams-doj-not-backing-him-e-jean-carroll-case-repeats-claim-doesnt-know-her
https://www.foxnews.com/politics/ron-desantis-wont-consider-being-trumps-running-mate-says-hes-not-number-2
https://www.foxnews.com/politics/georgia-swears-grand-jury-hear-trump-2020-election-case
https://www.foxnews.com/politics/chris-christie-tim-scott-reach-donor-requirement-participate-first-gop-debate-trump-yet-commit
https://www.newyorker.com/news/the-political-scene/will-trumps-crimes-matter-on-the-campaign-trail
https://occupydemocrats.com/2023/02/22/exploitation-trump-slammed-for-politicizing-toxic-train-derailment-with-ohio-visit/
https://www.thegatewaypundit.com/2023/06/pres-trump-defends-punching-down-politics-says-its/
https://www.thegatewaypundit.com/2023/06/breaking-poll-trump-most-popular-politician-country-rfk/
NOTE: You can use the combine
, lor
, and land
operators on ExtractionFilter
as well.
Make sure to only use them on filters of the same kind.
Fundus supports different sources for articles which are split into two categories:
- Only recent articles:
RSSFeed
,NewsMap
(recommended for continuous crawling jobs) - The whole site:
Sitemap
(recommended for one-time crawling)
NOTE: Sometimes the Sitemap
provided by a specific publisher won't span the entire site.
You can preselect the source for your articles when initializing a new Crawler
.
Let's initiate a crawler who only crawls from NewsMaps
's.
from fundus import Crawler, PublisherCollection, NewsMap
crawler = Crawler(PublisherCollection.us, restrict_sources_to=[NewsMap])
NOTE: The restrict_sources_to
parameter expects a list as value to specify multiple sources at once, e.g. [RSSFeed, NewsMap]
The crawl()
method supports functionality to filter out articles with URLs previously encountered in this run.
You can alter this behavior by setting the only_unique
parameter.
In the next section we will show you how to search through publishers in the PublisherCollection
.