## News Extractors

This notebook is to test the NewsExtractor class created to retrieve news based on an specific feed_url, which is the url that contains a list of news

In [1]:
import sys

sys.path.append("..")

from ai_news_pipeline.extractors.news.news_extractors import NewsExtractor
from ai_news_pipeline.config import AINewsConfig

In [2]:
news_config = AINewsConfig()

#### Instance a NewsExtractor class, which will retrieve the articles/news from an specific feed url

The NewsExtractor automatically chooses an ImageExtractor class based on the feed_url

In [3]:
ai_news_extractor = NewsExtractor(feed_url=news_config.AI_NEWS_FEED_URL)

[32m2025-10-08 02:08:47.321[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.image_url.extractor_selector[0m:[36mget_extractor[0m:[36m73[0m - [1mExtractor 'AINEWSImageExtractor' selected for base URL: https://www.artificialintelligence-news.com[0m


## Get the raw parsed data from the Feed URL

In [4]:
ainews_articles = ai_news_extractor._get_raw_parsed_data()

In [None]:
n_aiarticles = len(ainews_articles)
print(f"{n_aiarticles} articles  obtained...")
ainews_articles[:min(n_aiarticles, 5)]

12 articles  obtained...


[{'title': 'How AI is changing the way we travel',
  'news_link': 'https://www.artificialintelligence-news.com/news/how-ai-is-changing-the-way-we-travel/',
  'image_link': 'https://www.artificialintelligence-news.com/wp-content/uploads/2025/10/From-inspiration-to-itinerary_How-AI-is-rewriting-travel-1024x683.jpg',
  'publish_date': 'Tue, 07 Oct 2025 11:00:00 +0000'},
 {'title': '5 best AI observability tools in 2025',
  'news_link': 'https://www.artificialintelligence-news.com/news/5-best-ai-observability-tools-in-2025/',
  'image_link': 'https://www.artificialintelligence-news.com/wp-content/uploads/2025/10/igor-omilaev-IsYT5rUuVcs-unsplash-1-1024x576.jpg',
  'publish_date': 'Mon, 06 Oct 2025 14:00:42 +0000'},
 {'title': 'Google’s new AI agent rewrites code to automate vulnerability fixes',
  'news_link': 'https://www.artificialintelligence-news.com/news/google-new-ai-agent-rewrites-code-automate-vulnerability-fixes/',
  'image_link': 'https://www.artificialintelligence-news.com/wp-co

If the feed_url does not contains an image extractor class suited for it, it will retrieve only the available data

This is because the image is extracted directly from the HTML of the webpage, and each one has a different structure with different components

In [6]:
# Example of feed url that does not have a specific ImageExtractor class
xataka_feed_url = "https://www.xataka.com/feedburner.xml"

xataka_news_extractor = NewsExtractor(feed_url = xataka_feed_url)



In [7]:
xataka_articles = xataka_news_extractor._get_raw_parsed_data()

In [None]:
n_aiarticles = len(xataka_articles)
print(f"{n_aiarticles} articles obtained...")
xataka_articles[:min(n_aiarticles, 5)]

26 articles obtained...


[{'title': 'Las redes sociales comenzaron a morir en 2022 y nadie se dio cuenta. La nueva pesadilla es que resurjan llenas de AI Slop',
  'news_link': 'https://www.xataka.com/servicios/redes-sociales-comenzaron-a-morir-2022-nadie-se-dio-cuenta-nueva-pesadilla-que-resurjan-llenas-ai-slop',
  'image_link': '',
  'publish_date': 'Wed, 08 Oct 2025 10:01:43 +0200'},
 {'title': 'A La 1 solo le quedaba por ganar la batalla de las mañanas. Lo ha conseguido como con todo lo demás: politizando sus contenidos',
  'news_link': 'https://www.xataka.com/cine-y-tv/a-1-solo-le-quedaba-ganar-batalla-mananas-ha-conseguido-como-todo-demas-politizando-sus-contenidos',
  'image_link': '',
  'publish_date': 'Wed, 08 Oct 2025 09:31:43 +0200'},
 {'title': 'Detrás del Nobel de Medicina de este año hay toda una lección de política científica para España y no parece que la vayamos a aprender',
  'news_link': 'https://www.xataka.com/investigacion/detras-nobel-medicina-este-ano-hay-toda-leccion-politica-cientifica-

## Filter the data based on keywords

This keywords can be set in the main

In [12]:
filtered_news = ai_news_extractor._filter_by_keywords(xataka_articles)

[32m2025-10-08 02:10:47.305[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_filter_by_keywords[0m:[36m118[0m - [1mFiltering by the next parameters...[0m
[32m2025-10-08 02:10:47.306[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_filter_by_keywords[0m:[36m119[0m - [1mfilter_key = 'title'[0m
[32m2025-10-08 02:10:47.306[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_filter_by_keywords[0m:[36m120[0m - [1mcase_sen_search_kw =[' AI ', 'AI ', 'AI ', 'A.I.', ' AI-', 'AI-'][0m
[32m2025-10-08 02:10:47.307[0m | [1mINFO    [0m | [36mai_news_pipeline.extractors.news.news_extractors[0m:[36m_filter_by_keywords[0m:[36m121[0m - [1mcase_insen_search_kw =['Artificial Intelligence', 'Machine Learning', 'Deep Learning', 'Neural Networks', 'NLP', 'Computer Vision', 'Data Science', 'Gemini', 'Bard', 'ChatGPT', 'GPT-4', 'DALL-E', 'MidJourney', 'Stable Diffusi

In [14]:
n_aiarticles = len(filtered_news)
print(f"{n_aiarticles} articles obtained...")
filtered_news[:min(n_aiarticles, 5)]

2 articles obtained...


[{'title': 'Las redes sociales comenzaron a morir en 2022 y nadie se dio cuenta. La nueva pesadilla es que resurjan llenas de AI Slop',
  'news_link': 'https://www.xataka.com/servicios/redes-sociales-comenzaron-a-morir-2022-nadie-se-dio-cuenta-nueva-pesadilla-que-resurjan-llenas-ai-slop',
  'image_link': '',
  'publish_date': 'Wed, 08 Oct 2025 10:01:43 +0200'},
 {'title': 'AEMET acaba de bautizar a su primera DANA y se llama Alice: por qué este nombre lo cambia todo en las alertas meteorológicas',
  'news_link': 'https://www.xataka.com/ecologia-y-naturaleza/se-acabo-hablar-gota-fria-aemet-tiene-lista-nombres-para-danas-que-realmente-deben-preocuparnos',
  'image_link': '',
  'publish_date': 'Tue, 07 Oct 2025 21:01:43 +0200'}]