# RSS News Feeds Analysis Notebook

This notebook pulls multiple RSS feeds, aggregates the data, and performs basic analysis and visualization.  
To gather a dataset, we plan to use a github action that every will run and store the news a json.  
This will allow us to compare and analyze trend in the medias.

In [19]:
import feedparser
import pandas as pd
import matplotlib.pyplot as plt

# List your RSS feed URLs here
feed_urls = [
    "https://feeds.bbci.co.uk/news/rss.xml",
    "https://reutersbest.com/feed/",
    "https://www.euronews.com/rss",
    "https://www.wired.com/feed/rss",
    "https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en",
    "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml"

]

# List to hold all the feed entries
entries = []

# Pull and parse each feed
for url in feed_urls:
    try:
        feed = feedparser.parse(url)
        if 'title' not in feed.feed:
            print(f"Skipping feed {url} as it does not contain a title.")
            continue
        source_title = feed.feed.get('title', 'Unknown Source')
        for entry in feed.entries:
            entries.append({
                'source': source_title,
                'title': entry.get('title', ''),
                'published': entry.get('published', entry.get('updated', entry.get('pubDate', ''))),
                'link': entry.get('link', ''),
                'summary': entry.get('summary', entry.get('description', ''))
            })
    except Exception as e:
        print(f"Error parsing feed {url}: {e}")

# Create a DataFrame from the entries
df = pd.DataFrame(entries)

# Convert the published date to a datetime object, handling errors gracefully
df['published'] = pd.to_datetime(df['published'], errors='coerce')

# Display a sample of the data
print('Sample entries:')
display(df.head())

## Basic Analysis and Statistics

We now perform some basic analysis on the aggregated data.

In [20]:
# 1. Count of news items per source
news_count = df['source'].value_counts()
print('News count per source:')
display(news_count)

# Extract the words count of each news, the title and the summary
df['word_count'] = df['title'].str.split().str.len()
df['word_count'] = df['summary'].str.split().str.len() + df['word_count']
# visualize the word count
df['word_count'].plot(kind='hist', bins=20)
plt.title('Word count distribution')
plt.xlabel('Word count')
plt.ylabel('Frequency')
plt.show()



## Visualizations

- A bar chart showing the number of news items per source.

In [21]:
# Plot: Number of News Items per Source
plt.figure(figsize=(10, 5))
news_count.plot(kind='bar')
plt.title('Number of News Items per Source')
plt.xlabel('Source')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

In [26]:
%pip install nltk

Next, let's look at the similarity between each article descriptions using cosine similarity.  
This may give us insight in how article each relate to each others.
 


In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import re
from collections import Counter
from nltk.corpus import stopwords
import nltk

# Download stopwords if not already
nltk.download('stopwords')

# Prepare English stopwords
stop_words = set(stopwords.words('english'))

# Drop duplicates and null titles
df_titles = df[['title']].dropna().drop_duplicates().reset_index(drop=True)

# Basic text cleaning function
def preprocess(text):
    text = re.sub(r'[^\w\s]', '', text.lower())  # Remove punctuation and lowercase
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    return tokens

df_titles['tokens'] = df_titles['title'].apply(preprocess)

# Flatten list of tokens to get word frequencies
all_tokens = [token for sublist in df_titles['tokens'] for token in sublist]
token_freq = Counter(all_tokens)

# Filter out infrequent words
df_titles['filtered_tokens'] = df_titles['tokens'].apply(
    lambda tokens: [t for t in tokens if token_freq[t] > 1]
)

# Convert token lists back to strings
df_titles['cleaned_title'] = df_titles['filtered_tokens'].apply(lambda tokens: ' '.join(tokens))

# TF-IDF Vectorization on cleaned titles
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df_titles['cleaned_title'])

# Compute similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix)

# Plot the similarity heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(cosine_sim, cmap='viridis')
plt.title("Cosine Similarity Between Cleaned Article Titles")
plt.xlabel("Article Index")
plt.ylabel("Article Index")
plt.tight_layout()
plt.show()


As we can see, this is quite disapointing but expected.   
As new from different outlets may use quite a different wording.  
To counter this we can use emeding model.   
There a neat libary call [BERTopic](https://maartengr.github.io/BERTopic/index.html) to do so.  
Here is sneak peak of what it can do : 

In [None]:
%pip install bertopic

In [None]:
from bertopic import BERTopic

# Use the same cleaned titles from the previous step
titles_cleaned = df_titles['cleaned_title'].tolist()

# Create and fit BERTopic model
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(titles_cleaned)

# Visualize topics in different ways
topic_model.visualize_topics()  

topic_model.visualize_barchart(top_n_topics=10)

topic_model.visualize_heatmap()
