# AI-News Extraction

The idea is extract news data from different links, the data that must be obtained is:

- Title
- Publish date
- Link to news
- Link to image

In [1]:
import requests
from bs4 import BeautifulSoup
import feedparser
from datetime import datetime, timezone

## News extraction from [AINEWS page](https://www.artificialintelligence-news.com/)

In [2]:
rss_url = "https://www.artificialintelligence-news.com/artificial-intelligence-news/feed/" # link to RSS feed
feed = feedparser.parse(rss_url)

# Extract relevant data from feed entries
ai_news = [
    {
        "title": entry.title,
        "news_link": entry.link,
        "date": entry.published,
        "summary": entry.summary
    }
    for entry in feed.entries
    ]

ai_news

[{'title': 'AI causes reduction in users’ brain activity – MIT',
  'news_link': 'https://www.artificialintelligence-news.com/news/ai-causes-reduction-in-users-brain-activity-mit/',
  'date': 'Wed, 01 Oct 2025 13:44:30 +0000',
  'summary': '<p>A study from MIT (Massachusetts Institute of Technology) has found that the human brain not only works less hard when using an LLM, but its effects continue, negatively affecting mental activity in future work. The researchers used a limited number of subjects for their experiments (a limitation stated in the paper [PDF]), who were asked [&#8230;]</p>\n<p>The post <a href="https://www.artificialintelligence-news.com/news/ai-causes-reduction-in-users-brain-activity-mit/">AI causes reduction in users’ brain activity – MIT</a> appeared first on <a href="https://www.artificialintelligence-news.com">AI News</a>.</p>'},
 {'title': 'The 5 best AI AppSec tools in 2025',
  'news_link': 'https://www.artificialintelligence-news.com/news/the-5-best-ai-appsec-

As an image of the news is also required, but feedparser does not provide it, the image will be extracted directly from the link of the news

In [5]:
url = "https://www.artificialintelligence-news.com/news/the-5-best-ai-appsec-tools-in-2025/"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"
}

try:
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()
except requests.RequestException as e:
    print(f"Error fetching page: {e}")

soup = BeautifulSoup(response.content, 'html.parser')

all_containers = soup.select('.elementor-widget-container')

containers_with_images = [c for c in all_containers if c.find('img')]
image = [container.find('img').get("src") for container in containers_with_images if container.find('img').get('width') == '800'][0]

image


'https://www.artificialintelligence-news.com/wp-content/uploads/2025/09/Untitled-design-73-1024x573.png'

## Defining functions to extract all the AI-News information

As the date obtained from the feed is in the format 'Tue, 30 Sep 2025 11:07:47 +0000', it is required to convert it to '%Y-%m-%d %H:%M:%SZ'

In [3]:
def format_date(date_str: str) -> str:
    """
    Format a RFC 2822 date string to ISO 8601 format with 'Z' suffix.

    Args:
        date_str (str): Date string in RFC 2822 format (e.g., 'Fri 09 Oct 2020 14:19:00 +0000').
    
    Returns:
        str: Date string in ISO 8601 format (e.g., '2020-10-09 14:19:00Z').
    """
    if not isinstance(date_str, str):
        raise ValueError("Input must be a string in the format '%a, %d %b %Y %H:%M:%S %z' (e.g.: 'Fri 09 Oct 2020 14:19:00 +0000').")
    
    # Parsear la fecha RFC 2822
    dt = datetime.strptime(date_str, r'%a, %d %b %Y %H:%M:%S %z')
    # Convertir a formato ISO 8601 con sufijo Z
    return dt.strftime(r'%Y-%m-%d %H:%M:%SZ')

In [13]:
def extract_news_image(news_url: str) -> str:
    """
    Extract the main image URL from a news article page.

    Args:
        news_url (str): URL of the news article.
    
    Returns:
        str: URL of the main image in the article.
    """
    if not isinstance(news_url, str):
        raise ValueError("Input must be a string representing the news article URL.")
    
    headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"
    }

    try:
        response = requests.get(news_url, headers=headers, timeout=10)
        response.raise_for_status()
    except requests.RequestException as e:
        print(f"Error fetching page: {e}")

    soup = BeautifulSoup(response.content, 'html.parser')

    all_containers = soup.select('.elementor-widget-container')

    containers_with_images = [c for c in all_containers if c.find('img')]
    return [container.find('img').get("src") for container in containers_with_images if container.find('img').get('width') == '800'][0]
    

In [None]:
def retrieve_ai_news(url: str) -> list[dict]:
    """
    Retrieve AI news from the given RSS feed URL.

    Args:
        url (str): URL of the RSS feed.

    Returns:
        list[dict]: List of dictionaries containing news details.    
    """
    if not isinstance(url, str):
        raise ValueError("Input must be a string representing the RSS feed URL.")
    feed = feedparser.parse(rss_url)

    # Extract relevant data from feed entries
    ai_news = [
        {
            "title": entry.title,
            "news_link": entry.link,
            "publish_date": format_date(entry.published),
            "image_link": extract_news_image(entry.link),
            "summary": entry.summary
        }
        for entry in feed.entries
        ]

    return ai_news

In [14]:
retrieve_ai_news(rss_url)

[{'title': 'AI causes reduction in users’ brain activity – MIT',
  'news_link': 'https://www.artificialintelligence-news.com/news/ai-causes-reduction-in-users-brain-activity-mit/',
  'publish_date': '2025-10-01 13:44:30Z',
  'image_link': 'https://www.artificialintelligence-news.com/wp-content/uploads/2025/10/ai-cognitive-hero-1024x683.jpg',
  'summary': '<p>A study from MIT (Massachusetts Institute of Technology) has found that the human brain not only works less hard when using an LLM, but its effects continue, negatively affecting mental activity in future work. The researchers used a limited number of subjects for their experiments (a limitation stated in the paper [PDF]), who were asked [&#8230;]</p>\n<p>The post <a href="https://www.artificialintelligence-news.com/news/ai-causes-reduction-in-users-brain-activity-mit/">AI causes reduction in users’ brain activity – MIT</a> appeared first on <a href="https://www.artificialintelligence-news.com">AI News</a>.</p>'},
 {'title': 'The 5