## News Extractors

This notebook is to test the NewsExtractor class created to retrieve news based on an specific feed_url, which is the url that contains a list of news

In [1]:
import sys
import os
from dotenv import load_dotenv

sys.path.append("..")

from ai_news_pipeline.extractors.news.news_extractors import NewsExtractor
from ai_news_pipeline.config import AINewsConfig

load_dotenv()

True

In [2]:
news_config = AINewsConfig()

## Instanciate a NewsExtractor class

By instanciating a NewsExtractor class once, it can handle multiple feed_urls

- The NewsExtractor automatically chooses an ImageExtractor class based on the feed_url

In [None]:
general_news_extractor = NewsExtractor()

In [4]:
mit_news_extractor = NewsExtractor(feed_url=news_config.MIT_NEWS_FEED_URL)

[32m2025-10-08 12:04:30.645[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.image_url.extractor_selector[0m:[36mget_extractor[0m:[36m73[0m - [1mExtractor 'MITImageExtractor' selected for base URL: https://news.mit.edu[0m


## Get the raw parsed data from the Feed URL

In [5]:
ai_news_extractor._get_articles()

[32m2025-10-08 12:04:46.903[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_get_articles[0m:[36m88[0m - [1m12 articles extracted[0m


In [6]:
n_aiarticles = len(ai_news_extractor.current_data)
print(f"{n_aiarticles} articles  obtained...")
ai_news_extractor.current_data.head(min(n_aiarticles, 3))

12 articles  obtained...


Unnamed: 0,title,news_link,image_link,publish_date
0,Samsungs tiny AI model beats giant reasoning LLMs,https://www.artificialintelligence-news.com/ne...,https://www.artificialintelligence-news.com/wp...,2025-10-08 11:55:19+00:00
1,Tuned Global strengthens its leadership in mus...,https://www.artificialintelligence-news.com/ne...,https://www.artificialintelligence-news.com/wp...,2025-10-08 10:37:29+00:00
2,AI Redaction That Puts Privacy First: CaseGuar...,https://www.artificialintelligence-news.com/ne...,https://www.artificialintelligence-news.com/wp...,2025-10-08 09:07:44+00:00


If the feed_url does not contains an image extractor class suited for it, it will retrieve only the available data

This is because the image is extracted directly from the HTML of the webpage, and each one has a different structure with different components

In [7]:
# Example of feed url that does not have a specific ImageExtractor class
xataka_feed_url = "https://www.xataka.com/feedburner.xml"

xataka_news_extractor = NewsExtractor(feed_url = xataka_feed_url)



In [8]:
xataka_news_extractor._get_articles()

[32m2025-10-08 12:04:47.403[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_get_articles[0m:[36m88[0m - [1m26 articles extracted[0m


In [9]:
n_aiarticles = len(xataka_news_extractor.current_data)
print(f"{n_aiarticles} articles obtained...")
xataka_news_extractor.current_data.head(min(n_aiarticles, 3))

26 articles obtained...


Unnamed: 0,title,news_link,image_link,publish_date
0,"""Son cosas por las que un universitario se met...",https://www.xataka.com/robotica-e-ia/cosas-que...,,2025-10-08 20:01:43+02:00
1,La justicia da la razón a los empleados de Hol...,https://www.xataka.com/empresas-y-economia/jus...,,2025-10-08 19:40:24+02:00
2,Hace cinco años trabajaba desde su baño al bor...,https://www.xataka.com/empresas-y-economia/hac...,,2025-10-08 19:31:43+02:00


In [10]:
xataka_news_extractor.current_data.image_link.isnull().head()

0    True
1    True
2    True
3    True
4    True
Name: image_link, dtype: bool

If the feed_url contains a invalid url to retrieve the data, it will display an error log

In [11]:
wrong_feed_url = "https://techcrunch.com/category/artificial-intelligence/"
bad_news_extractor = NewsExtractor(feed_url=wrong_feed_url)



In [12]:
bad_news_extractor._get_articles()

[32m2025-10-08 12:04:48.536[0m | [31m[1mERROR   [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_get_articles[0m:[36m96[0m - [31m[1mNo articles extracted from https://techcrunch.com/category/artificial-intelligence/  Make sure the url is RSS-compatible[0m


## Filter the data based on keywords

In [13]:
ai_news_extractor._filter_title_by_keywords()

[32m2025-10-08 12:04:48.541[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_filter_title_by_keywords[0m:[36m134[0m - [1mFiltering articles by the next parameters...[0m
[32m2025-10-08 12:04:48.541[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_filter_title_by_keywords[0m:[36m135[0m - [1mcase_sen_search_kw =[' AI ', 'AI ', 'AI ', 'A.I.', ' AI-', 'AI-'][0m
[32m2025-10-08 12:04:48.541[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_filter_title_by_keywords[0m:[36m136[0m - [1mcase_insen_search_kw =['Artificial Intelligence', 'Machine Learning', 'Deep Learning', 'Neural Networks', 'NLP', 'Computer Vision', 'Data Science', 'Gemini', 'Bard', 'ChatGPT', 'GPT-4', 'DALL-E', 'MidJourney', 'Stable Diffusion', 'Claude', 'LLaMA', 'Whisper'][0m
[32m2025-10-08 12:04:48.541[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_f

In [14]:
n_aiarticles = len(ai_news_extractor.current_data)
print(f"{n_aiarticles} articles  after filtering...")
ai_news_extractor.current_data.head(min(n_aiarticles, 3))

10 articles  after filtering...


Unnamed: 0,title,news_link,image_link,publish_date
0,Samsungs tiny AI model beats giant reasoning LLMs,https://www.artificialintelligence-news.com/ne...,https://www.artificialintelligence-news.com/wp...,2025-10-08 11:55:19+00:00
1,AI Redaction That Puts Privacy First: CaseGuar...,https://www.artificialintelligence-news.com/ne...,https://www.artificialintelligence-news.com/wp...,2025-10-08 09:07:44+00:00
2,How AI is changing the way we travel,https://www.artificialintelligence-news.com/ne...,https://www.artificialintelligence-news.com/wp...,2025-10-07 11:00:00+00:00


In [15]:
bad_news_extractor._filter_title_by_keywords()

[32m2025-10-08 12:04:48.573[0m | [31m[1mERROR   [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_filter_title_by_keywords[0m:[36m128[0m - [31m[1mNo data found. Please run '_get_articles()' before calling this method.[0m


## Filter the data based on the publish date

In [16]:
ai_news_extractor._filter_by_age(max_days_old=7)

[32m2025-10-08 12:04:48.580[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_filter_by_age[0m:[36m181[0m - [1mFiltering articles published within the last 7 days.[0m
[32m2025-10-08 12:04:48.582[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_filter_by_age[0m:[36m197[0m - [1mDate filtering complete. 5 articles published within the allowed range.[0m


In [17]:
bad_news_extractor._filter_by_age()

[32m2025-10-08 12:04:48.588[0m | [31m[1mERROR   [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_filter_by_age[0m:[36m175[0m - [31m[1mNo data found. Please run '_get_articles()' before calling this method.[0m


## Storing the data in an excel file

In [18]:
ai_news_extractor._store_to_excel(local_file_path=os.getenv("AI_NEWS_FILE_PATH"))

[32m2025-10-08 12:04:48.617[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_store_to_excel[0m:[36m242[0m - [1mArticles successfully stored[0m


## Do all the previous steps with one method

In [19]:
ai_news_extractor.extract(local_file_path=os.getenv("AI_NEWS_FILE_PATH"))

[32m2025-10-08 12:04:48.627[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36mextract[0m:[36m274[0m - [1mStarting articles extraction...[0m
[32m2025-10-08 12:05:04.531[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_get_articles[0m:[36m88[0m - [1m12 articles extracted[0m
[32m2025-10-08 12:05:04.533[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_filter_title_by_keywords[0m:[36m134[0m - [1mFiltering articles by the next parameters...[0m
[32m2025-10-08 12:05:04.533[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_filter_title_by_keywords[0m:[36m135[0m - [1mcase_sen_search_kw =[' AI ', 'AI ', 'AI ', 'A.I.', ' AI-', 'AI-'][0m
[32m2025-10-08 12:05:04.533[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_filter_title_by_keywords[0m:[36m136[0m - [1mcase_insen_search_kw

In [None]:
ai_news_extractor

You can also select not to store the articles in an excel file

In [20]:
mit_news_extractor.extract(store_data=False)

[32m2025-10-08 12:05:04.563[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36mextract[0m:[36m274[0m - [1mStarting articles extraction...[0m
[32m2025-10-08 12:05:33.088[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_get_articles[0m:[36m88[0m - [1m50 articles extracted[0m
[32m2025-10-08 12:05:33.103[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_filter_title_by_keywords[0m:[36m134[0m - [1mFiltering articles by the next parameters...[0m
[32m2025-10-08 12:05:33.106[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_filter_title_by_keywords[0m:[36m135[0m - [1mcase_sen_search_kw =[' AI ', 'AI ', 'AI ', 'A.I.', ' AI-', 'AI-'][0m
[32m2025-10-08 12:05:33.107[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_filter_title_by_keywords[0m:[36m136[0m - [1mcase_insen_search_kw