## News Extractors

This notebook is to test the NewsExtractor class created to retrieve news based on an specific feed_url, which is the url that contains a list of news

In [1]:
import sys

sys.path.append("..")

from news_extraction_pipeline.extractors.news.news_extractors import NewsExtractor
from news_extraction_pipeline.config import AINewsConfig

In [2]:
news_config = AINewsConfig()

## Instanciate a NewsExtractor class

By instanciating a NewsExtractor class once, it can handle multiple feed_urls

- The NewsExtractor automatically chooses an ImageExtractor class based on the feed_url

In [3]:
extractor = NewsExtractor()

There's two ways to set the feed_url:

In [4]:
extractor.set_current_feed_url(news_config.MIT_NEWS_FEED_URL)

[32m2025-10-08 17:18:36.099[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36mcurrent_feed_url[0m:[36m82[0m - [1mSetting current feed url to https://news.mit.edu/rss/feed[0m
[32m2025-10-08 17:18:36.099[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.image_url.extractor_selector[0m:[36mget_extractor[0m:[36m73[0m - [1mExtractor 'MITImageExtractor' selected for base URL: https://news.mit.edu[0m


In [5]:
extractor.current_feed_url=news_config.AI_NEWS_FEED_URL

[32m2025-10-08 17:18:36.106[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36mcurrent_feed_url[0m:[36m82[0m - [1mSetting current feed url to https://www.artificialintelligence-news.com/artificial-intelligence-news/feed/[0m
[32m2025-10-08 17:18:36.108[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.image_url.extractor_selector[0m:[36mget_extractor[0m:[36m73[0m - [1mExtractor 'AINEWSImageExtractor' selected for base URL: https://www.artificialintelligence-news.com[0m


Also if there's a feed url that does not has an ImageExtractor, it will raise a warning

In [6]:
bad_feed_url = "https://techcrunch.com/category/artificial-intelligence/"

In [7]:
extractor.set_current_feed_url(bad_feed_url)

[32m2025-10-08 17:18:36.122[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36mcurrent_feed_url[0m:[36m82[0m - [1mSetting current feed url to https://techcrunch.com/category/artificial-intelligence/[0m


## Get the articles from the current feed url

In [10]:
extractor.current_feed_url = news_config.MIT_NEWS_FEED_URL

mit_articles = extractor.get_articles()

mit_articles.head(2)

[32m2025-10-08 17:19:25.072[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36mcurrent_feed_url[0m:[36m82[0m - [1mSetting current feed url to https://news.mit.edu/rss/feed[0m
[32m2025-10-08 17:19:25.073[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.image_url.extractor_selector[0m:[36mget_extractor[0m:[36m73[0m - [1mExtractor 'MITImageExtractor' selected for base URL: https://news.mit.edu[0m
[32m2025-10-08 17:19:59.104[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36mget_articles[0m:[36m151[0m - [1m50 articles extracted[0m


Unnamed: 0,title,news_link,image_link,publish_date
0,Immune-informed brain aging research offers ne...,https://news.mit.edu/2025/immune-informed-brai...,https://news.mit.edu/sites/default/files/style...,2025-10-08 15:30:00-04:00
1,MIT Schwarzman College of Computing and MBZUAI...,https://news.mit.edu/2025/mit-schwarzman-colle...,https://news.mit.edu/sites/default/files/style...,2025-10-08 15:10:00-04:00


If the feed url is the same and executes again the method .get_articles(), it will return the previous articles extracted

In [12]:
extractor.get_articles().head(2)

[32m2025-10-08 17:21:39.104[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36mget_articles[0m:[36m125[0m - [1mArticles from feed url https://news.mit.edu/rss/feed already extracted[0m


Unnamed: 0,title,news_link,image_link,publish_date
0,Immune-informed brain aging research offers ne...,https://news.mit.edu/2025/immune-informed-brai...,https://news.mit.edu/sites/default/files/style...,2025-10-08 15:30:00-04:00
1,MIT Schwarzman College of Computing and MBZUAI...,https://news.mit.edu/2025/mit-schwarzman-colle...,https://news.mit.edu/sites/default/files/style...,2025-10-08 15:10:00-04:00


If a normal url is introduced to feed_url, it will return an error

In [13]:
extractor.set_current_feed_url(bad_feed_url)

news = extractor.get_articles()

news

[32m2025-10-08 17:23:01.812[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36mcurrent_feed_url[0m:[36m82[0m - [1mSetting current feed url to https://techcrunch.com/category/artificial-intelligence/[0m
[32m2025-10-08 17:23:02.216[0m | [31m[1mERROR   [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36mget_articles[0m:[36m160[0m - [31m[1mNo articles extracted from https://techcrunch.com/category/artificial-intelligence/  Make sure the url is RSS-compatible[0m


After getting the dataframe, you can modify it as you want