# AI-News Extraction

This pipeline is part of the AI-Hub SharePoint Automation project, which tries to automate the most repetitive tasks such as manually adding new AI events and AI News.

This pipeline was previously developed in PowerAutomate using an RSS connector, which retrieves data from RSS links, which contains the next information:

- Title
- Publish date
- Link to news
- Link to image

To generate this rss links from normal website's urls, an external platform was being used. Nevertheless, it is required to replace them to avoid extra charges.

This notebook replaces the external platform, allowing website's urls to be used to extract the required information. The information obtained will be (temporary) stored in OneDrive thanks to the sync of a local folder to OneDrive. Nevertheless, at the moment of writing this notebook, an approval is in process to use Microsoft Graph and store the data obtained directly into a SharePoint list

## Full AI-News Extaction Pipeline -> Currently active in PowerAutomate

1. Extract AI news from different sources ([MIT](https://web.mit.edu/) and [AINEWS](https://www.artificialintelligence-news.com/)) using a RSS connector

2. Clean the data obtained from the news: 

    - Format dates and titles to allow SharePoint to correctly read them
    - Filter the news from keywords in their title (e.g. AI, ChatGPT, Gemini, LLMs)
    - Filter the news by date and returns those which publish date is from the last 2 days
    
3. Checks if the news obtained are already stored in a SharePoint list. If so, discard them, otherwise, store them into the SharePoint list

## New (temporal) Pipeline Proposed

1. Using PowerAutomate, trigger an email to the AI-Hub SharePoint owner to execute this notebook (further, python script), and waits until an excel file is updated to continue. Meanwhile, the following steps are executed.

2. Extract AI news from different sources using Python (currently, the only source available is [AINEWS](https://www.artificialintelligence-news.com/))

3. Clean the data obtained from the news using Python:

    - Format dates and titles to allow SharePoint to correctly read them
    - Filter the news from keywords in their title (e.g. AI, ChatGPT, Gemini, LLMs)
    - Filter the news by date and returns those which publish date is from the last 2 days

4. Store in an excel file the AI-News obtained using Python, which automatically is synced in OneDrive

5. PowerAutomate caches the excel update, and checks if the news obtained are already stored in a SharePoint list. If so, it discard them, otherwise, store them into the SharePoint list


**This notebook executes steps 2, 3, and 4 of the New (temporal) Pipeline Proposed**

In [169]:
import requests
from bs4 import BeautifulSoup
import feedparser
from datetime import datetime, timezone, timedelta
import pandas as pd
from openpyxl.worksheet.table import Table, TableStyleInfo
from openpyxl.utils import get_column_letter
import os

## News extraction from [AINEWS page](https://www.artificialintelligence-news.com/)

In [170]:
rss_url = "https://www.artificialintelligence-news.com/artificial-intelligence-news/feed/" # link to RSS feed
feed = feedparser.parse(rss_url)

keywords = ["AI", "A.I.", "Artificial Intelligence", "Machine Learning", "Deep Learning", "Neural Networks", "NLP", "Computer Vision", "Data Science", "Gemini", "Bard", "ChatGPT", "GPT-4", "DALL-E", "MidJourney", "Stable Diffusion", "Claude", "LLaMA", "Whisper"]

# Extract relevant data from feed entries
ai_news = [
    {
        "title": entry.title.replace('\'', ''), # Remove single quotes to avoid issues in SharePoint
        "news_link": entry.link,
        "date": entry.published
    }
    for entry in feed.entries if any(keyword.lower() in entry.title.lower() for keyword in keywords)
    ]

ai_news

[{'title': 'China Mobile Shanghai launches industry-first 5G-A network monetisation strategy with Huawei',
  'news_link': 'https://www.artificialintelligence-news.com/news/5g-a-shanghai-huawei-network-monetization-football/',
  'date': 'Fri, 03 Oct 2025 09:00:00 +0000'},
 {'title': 'AI causes reduction in users’ brain activity – MIT',
  'news_link': 'https://www.artificialintelligence-news.com/news/ai-causes-reduction-in-users-brain-activity-mit/',
  'date': 'Wed, 01 Oct 2025 13:44:30 +0000'},
 {'title': 'The 5 best AI AppSec tools in 2025',
  'news_link': 'https://www.artificialintelligence-news.com/news/the-5-best-ai-appsec-tools-in-2025/',
  'date': 'Wed, 01 Oct 2025 12:09:36 +0000'},
 {'title': 'Why AI phishing detection will define cybersecurity in 2026',
  'news_link': 'https://www.artificialintelligence-news.com/news/why-ai-phishing-detection-will-define-cybersecurity-in-2026/',
  'date': 'Wed, 01 Oct 2025 10:07:59 +0000'},
 {'title': 'Google: EU’s AI adoption lags China amid re

As an image of the news is also required, but feedparser does not provide it, the image will be extracted directly from the link of the news

In [171]:
url = "https://www.artificialintelligence-news.com/news/the-5-best-ai-appsec-tools-in-2025/"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"
}

try:
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()
except requests.RequestException as e:
    print(f"Error fetching page: {e}")

soup = BeautifulSoup(response.content, 'html.parser')

all_containers = soup.select('.elementor-widget-container')

containers_with_images = [c for c in all_containers if c.find('img')]
image = [container.find('img').get("src") for container in containers_with_images if container.find('img').get('width') == '800'][0]

image


'https://www.artificialintelligence-news.com/wp-content/uploads/2025/09/Untitled-design-73-1024x573.png'

## Defining functions to extract all the AI-News information

As the date obtained from the feed is in the format 'Tue, 30 Sep 2025 11:07:47 +0000', it is required by SharePoint to be in the format: '%Y-%m-%d %H:%M:%SZ'

In [172]:
def format_date(date_str: str) -> str:
    """
    Format a RFC 2822 date string to ISO 8601 format with 'Z' suffix.

    Args:
        date_str (str): Date string in RFC 2822 format (e.g., 'Fri 09 Oct 2020 14:19:00 +0000').
    
    Returns:
        str: Date string in ISO 8601 format (e.g., '2020-10-09 14:19:00Z').
    """
    if not isinstance(date_str, str):
        raise ValueError("Input must be a string in the format '%a, %d %b %Y %H:%M:%S %z' (e.g.: 'Fri 09 Oct 2020 14:19:00 +0000').")
    
    # Parsear la fecha RFC 2822
    dt = datetime.strptime(date_str, r'%a, %d %b %Y %H:%M:%S %z')
    # Convertir a formato ISO 8601 con sufijo Z
    return dt.strftime(r'%Y-%m-%d %H:%M:%SZ')

In [173]:
def extract_news_image(news_url: str) -> str:
    """
    Extract the main image URL from a news article page.

    Args:
        news_url (str): URL of the news article.
    
    Returns:
        str: URL of the main image in the article.
    """
    if not isinstance(news_url, str):
        raise ValueError("Input must be a string representing the news article URL.")
    
    headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"
    }

    try:
        response = requests.get(news_url, headers=headers, timeout=10)
        response.raise_for_status()
    except requests.RequestException as e:
        print(f"Error fetching page: {e}")

    soup = BeautifulSoup(response.content, 'html.parser')

    all_containers = soup.select('.elementor-widget-container')

    containers_with_images = [c for c in all_containers if c.find('img')]
    return [container.find('img').get("src") for container in containers_with_images if container.find('img').get('width') == '800'][0]
    

In [174]:
def retrieve_ai_news(url: str) -> list[dict]:
    """
    Retrieve AI news from the given RSS feed URL and process the data.

    Args:
        url (str): URL of the RSS feed.

    Returns:
        list[dict]: List of dictionaries containing news details.    
    """
    if not isinstance(url, str):
        raise ValueError("Input must be a string representing the RSS feed URL.")
    feed = feedparser.parse(rss_url)

    # Extract relevant data from feed entries
    ai_news = [
        {
            "title": entry.title.replace('\'', ''), # Remove single quotes to avoid issues in SharePoint
            "news_link": entry.link,
            "publish_date": format_date(entry.published),
            "image_link": extract_news_image(entry.link)
        }
        for entry in feed.entries
        ]

    return ai_news

In [175]:
ai_news_formatted = retrieve_ai_news(rss_url)

ai_news_formatted

[{'title': 'China Mobile Shanghai launches industry-first 5G-A network monetisation strategy with Huawei',
  'news_link': 'https://www.artificialintelligence-news.com/news/5g-a-shanghai-huawei-network-monetization-football/',
  'publish_date': '2025-10-03 09:00:00Z',
  'image_link': 'https://www.artificialintelligence-news.com/wp-content/uploads/2025/10/a92e94de-ee49-450d-93f4-e7507d176e0f.jpg.1200.800-1024x683.jpg'},
 {'title': 'AI causes reduction in users’ brain activity – MIT',
  'news_link': 'https://www.artificialintelligence-news.com/news/ai-causes-reduction-in-users-brain-activity-mit/',
  'publish_date': '2025-10-01 13:44:30Z',
  'image_link': 'https://www.artificialintelligence-news.com/wp-content/uploads/2025/10/ai-cognitive-hero-1024x683.jpg'},
 {'title': 'The 5 best AI AppSec tools in 2025',
  'news_link': 'https://www.artificialintelligence-news.com/news/the-5-best-ai-appsec-tools-in-2025/',
  'publish_date': '2025-10-01 12:09:36Z',
  'image_link': 'https://www.artificial

Converting the list of dictionaries into a pandas DataFrame to then save it as an excel file

In [176]:
ai_news_df = pd.DataFrame(ai_news_formatted)

In [177]:
ai_news_df.head()

Unnamed: 0,title,news_link,publish_date,image_link
0,China Mobile Shanghai launches industry-first ...,https://www.artificialintelligence-news.com/ne...,2025-10-03 09:00:00Z,https://www.artificialintelligence-news.com/wp...
1,AI causes reduction in users’ brain activity –...,https://www.artificialintelligence-news.com/ne...,2025-10-01 13:44:30Z,https://www.artificialintelligence-news.com/wp...
2,The 5 best AI AppSec tools in 2025,https://www.artificialintelligence-news.com/ne...,2025-10-01 12:09:36Z,https://www.artificialintelligence-news.com/wp...
3,Why AI phishing detection will define cybersec...,https://www.artificialintelligence-news.com/ne...,2025-10-01 10:07:59Z,https://www.artificialintelligence-news.com/wp...
4,Google: EU’s AI adoption lags China amid regul...,https://www.artificialintelligence-news.com/ne...,2025-10-01 09:54:47Z,https://www.artificialintelligence-news.com/wp...


## Filter the data by date (get the news of the last two days)

Checks if the publish_date column has a datetime data type

In [178]:
ai_news_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         12 non-null     object
 1   news_link     12 non-null     object
 2   publish_date  12 non-null     object
 3   image_link    12 non-null     object
dtypes: object(4)
memory usage: 516.0+ bytes


In [179]:
ai_news_df.publish_date = pd.to_datetime(ai_news_df.publish_date, format=r'%Y-%m-%d %H:%M:%SZ', utc=True)

In [180]:
ai_news_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype              
---  ------        --------------  -----              
 0   title         12 non-null     object             
 1   news_link     12 non-null     object             
 2   publish_date  12 non-null     datetime64[ns, UTC]
 3   image_link    12 non-null     object             
dtypes: datetime64[ns, UTC](1), object(3)
memory usage: 516.0+ bytes


Gets the current time in Mexico City timezone

In [181]:
mexico_city_timezone = timezone(timedelta(hours=-6))

now_utc = datetime.now(timezone.utc)

mexico_city_time = now_utc.astimezone(mexico_city_timezone)

mexico_city_time

datetime.datetime(2025, 10, 3, 23, 31, 16, 715978, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=64800)))

In [182]:
ai_news_df_filtered = ai_news_df[ai_news_df.publish_date >= mexico_city_time - timedelta(days=2)]

print(f"Newest date: {ai_news_df_filtered.publish_date.max()}\nOldest date: {ai_news_df_filtered.publish_date.min()}")

Newest date: 2025-10-03 09:00:00+00:00
Oldest date: 2025-10-03 09:00:00+00:00


In [183]:
ai_news_df_filtered.head()

Unnamed: 0,title,news_link,publish_date,image_link
0,China Mobile Shanghai launches industry-first ...,https://www.artificialintelligence-news.com/ne...,2025-10-03 09:00:00+00:00,https://www.artificialintelligence-news.com/wp...


After making the date filtering, it is needed to return the datatype of publish_date to a string, as excel does not recognize datetime datatypes

In [184]:
# The column name is changed to publish_date_str due to a 
# FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future version.
ai_news_df_filtered.loc[:, "publish_date_str"] = ai_news_df_filtered.publish_date.dt.strftime(r'%Y-%m-%d %H:%M:%SZ')

ai_news_df_filtered.drop("publish_date", axis=1, inplace=True)

ai_news_df_filtered.rename(columns={"publish_date_str": "publish_date"}, inplace=True)

ai_news_df_filtered.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ai_news_df_filtered.loc[:, "publish_date_str"] = ai_news_df_filtered.publish_date.dt.strftime(r'%Y-%m-%d %H:%M:%SZ')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ai_news_df_filtered.drop("publish_date", axis=1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ai_news_df_filtered.rename(columns={"publish_date_str": "publish_date"}, inplace=True)


Unnamed: 0,title,news_link,image_link,publish_date
0,China Mobile Shanghai launches industry-first ...,https://www.artificialintelligence-news.com/ne...,https://www.artificialintelligence-news.com/wp...,2025-10-03 09:00:00Z


## Store the data obtained in an excel file

In [185]:
local_file_path = "../data/ai-news.xlsx"  # Path to save the Excel file

file_name = local_file_path.split("/")[-1]
path = '/'.join(local_file_path.split("/")[:-1])

os.path.exists(path)

True

In [187]:
with pd.ExcelWriter(local_file_path, engine='openpyxl') as writer:
    ai_news_df_filtered.to_excel(writer, index=False, sheet_name='AI-News')
    worksheet = writer.sheets["AI-News"]
    (max_row, max_col) = ai_news_df_filtered.shape

    # Calcula el rango de la tabla en formato Excel (por ejemplo, "A1:D10")
    table_ref = f"A1:{get_column_letter(max_col)}{max_row + 1}"

    table = Table(displayName="RecentAINews", ref=table_ref)
    style = TableStyleInfo(name="TableStyleMedium9", showFirstColumn=False,
                           showLastColumn=False, showRowStripes=True, showColumnStripes=False)
    table.tableStyleInfo = style
    worksheet.add_table(table)

# Testing Final Pipeline Steps

In [6]:
import sys

sys.path.append("..")

from ai_news_pipeline.ai_news_pipeline_steps import retrieve_ai_news, filter_news_by_date
from ai_news_pipeline.config import AINewsConfig

In [2]:
news_config = AINewsConfig()

In [3]:
news_config.CASE_SEN_SEARCH_KW

[' AI ', 'AI ', 'AI ', 'A.I.', ' AI-', 'AI-']

In [7]:
text=["China Mobile Shanghai launches industry-first 5G-A network",]

In [8]:
[x for x in text if 
any(kw in x for kw in news_config.CASE_SEN_SEARCH_KW)
        or any(kw.lower() in x.lower() for kw in news_config.CASE_INSEN_SEARCH_KW)
]

[]

In [12]:
ai_news = retrieve_ai_news(
    url=news_config.NEWS_URL,
    case_insen_search_kw=news_config.CASE_INSEN_SEARCH_KW,
    case_sen_search_kw=news_config.CASE_SEN_SEARCH_KW,
)

[32m2025-10-06 10:01:21.041[0m | [1mINFO    [0m | [36mai_news_pipeline.ai_news_pipeline_steps[0m:[36mretrieve_ai_news[0m:[36m38[0m - [1mFetching RSS feed...[0m
[32m2025-10-06 10:01:22.142[0m | [1mINFO    [0m | [36mai_news_pipeline.ai_news_pipeline_steps[0m:[36mretrieve_ai_news[0m:[36m45[0m - [1mFiltering news articles based on keywords: 
 AI 
AI 
AI 
A.I.
 AI-
AI-
Artificial Intelligence
Machine Learning
Deep Learning
Neural Networks
NLP
Computer Vision
Data Science
Gemini
Bard
ChatGPT
GPT-4
DALL-E
MidJourney
Stable Diffusion
Claude
LLaMA
Whisper[0m
[32m2025-10-06 10:01:35.247[0m | [1mINFO    [0m | [36mai_news_pipeline.ai_news_pipeline_steps[0m:[36mretrieve_ai_news[0m:[36m62[0m - [1mNews found:
[0m
[32m2025-10-06 10:01:35.247[0m | [1mINFO    [0m | [36mai_news_pipeline.ai_news_pipeline_steps[0m:[36mretrieve_ai_news[0m:[36m63[0m - [1mGoogle’s new AI agent rewrites code to automate vulnerability fixes
AI causes reduction in users’ brain activ

In [8]:
ai_news

Unnamed: 0,title,news_link,image_link,publish_date
0,Google’s new AI agent rewrites code to automat...,https://www.artificialintelligence-news.com/ne...,https://www.artificialintelligence-news.com/wp...,2025-10-06 13:56:40Z
1,AI causes reduction in users’ brain activity –...,https://www.artificialintelligence-news.com/ne...,https://www.artificialintelligence-news.com/wp...,2025-10-01 13:44:30Z
2,The 5 best AI AppSec tools in 2025,https://www.artificialintelligence-news.com/ne...,https://www.artificialintelligence-news.com/wp...,2025-10-01 12:09:36Z
3,Why AI phishing detection will define cybersec...,https://www.artificialintelligence-news.com/ne...,https://www.artificialintelligence-news.com/wp...,2025-10-01 10:07:59Z
4,Google: EU’s AI adoption lags China amid regul...,https://www.artificialintelligence-news.com/ne...,https://www.artificialintelligence-news.com/wp...,2025-10-01 09:54:47Z
5,The value gap from AI investments is widening ...,https://www.artificialintelligence-news.com/ne...,https://www.artificialintelligence-news.com/wp...,2025-09-30 12:35:19Z
6,The rise of algorithmic agriculture? AI steps in,https://www.artificialintelligence-news.com/ne...,https://www.artificialintelligence-news.com/wp...,2025-09-30 11:07:47Z
7,Rising AI demands push Asia Pacific data centr...,https://www.artificialintelligence-news.com/ne...,https://www.artificialintelligence-news.com/wp...,2025-09-30 08:15:55Z
8,Reply’s pre-built AI apps aim to fast-track AI...,https://www.artificialintelligence-news.com/ne...,https://www.artificialintelligence-news.com/wp...,2025-09-30 08:13:03Z
9,Huawei details open-source AI development road...,https://www.artificialintelligence-news.com/ne...,https://www.artificialintelligence-news.com/wp...,2025-09-29 08:34:34Z


In [11]:
filter_news_by_date(ai_news, date_column=news_config.DATE_COLUMN, days=news_config.DAYS_BACK)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df.loc[:, f"{date_column}_str"] = filtered_df[date_column].dt.strftime(
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df.drop(date_column, axis=1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df.rename(columns={f"{date_column}_str": date_column}, inplace=True)


Unnamed: 0,title,news_link,image_link,publish_date
0,Google’s new AI agent rewrites code to automat...,https://www.artificialintelligence-news.com/ne...,https://www.artificialintelligence-news.com/wp...,2025-10-06 13:56:40Z
