# 📰 News Trend Analysis - Technical Documentation

**Author**: David Josipović  
**Course**: PI (Poslovna Inteligencija) & MOPJ (Metode Obrade Prirodnog Jezika)  
**Date**: November 2025

---

## Table of Contents
1. [Business Problem](#1-business-problem)
2. [Data Understanding](#2-data-understanding)
3. [Data Preparation](#3-data-preparation)
4. [Modeling](#4-modeling)
5. [Evaluation](#5-evaluation)
6. [Results & Insights](#6-results--insights)
7. [Deployment](#7-deployment)
8. [Conclusion](#8-conclusion)

## 1. Business Problem

### Problem Statement
Manual tracking of economic news sentiment is:
- **Time-consuming**: Reading hundreds of articles daily
- **Subjective**: Different analysts interpret differently
- **Not scalable**: Cannot process multiple sources simultaneously

### Proposed Solution
Automated AI-powered pipeline that:
1. **Fetches** economic news every 12 hours (NewsData.io API)
2. **Analyzes** sentiment using FinBERT (76% confidence)
3. **Discovers** trending topics with BERTopic (unsupervised)
4. **Summarizes** articles with DistilBART (37.7x compression)
5. **Visualizes** insights in interactive Streamlit dashboard

### Business Value
- 📈 **Investors**: Real-time market sentiment tracking
- 📰 **Media Organizations**: Identify trending narratives
- 📊 **Financial Analysts**: Automated research assistance
- 🏢 **Businesses**: Monitor industry trends and competitor news

## 2. Data Understanding

### 2.1 Data Source
- **API**: NewsData.io (free tier)
- **Query**: `"economy OR business OR finance"`
- **Language**: English
- **Update Frequency**: 2x daily (8:00 & 20:00 UTC)
- **Time Period**: Last 7 days (rolling window)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from datetime import datetime

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Load data
df = pd.read_csv('data/processed/articles_with_summary.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()

### 2.2 Exploratory Data Analysis

In [None]:
# Basic statistics
print("=== DATASET STATISTICS ===")
print(f"Total articles: {len(df)}")
print(f"Unique titles: {df['title'].nunique()}")
print(f"Unique sources: {df['source'].nunique()}")
print(f"\nDate range:")
df['publishedAt'] = pd.to_datetime(df['publishedAt'])
print(f"From: {df['publishedAt'].min()}")
print(f"To: {df['publishedAt'].max()}")
print(f"\nMissing values:")
print(df.isnull().sum())

In [None]:
# Text length distribution
df['text_length'] = df['text'].str.len()

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Histogram
axes[0].hist(df['text_length'], bins=30, edgecolor='black', alpha=0.7)
axes[0].axvline(df['text_length'].mean(), color='red', linestyle='--', 
                label=f'Mean: {df["text_length"].mean():.0f} chars')
axes[0].set_title('Distribution of Article Lengths')
axes[0].set_xlabel('Number of Characters')
axes[0].set_ylabel('Frequency')
axes[0].legend()

# Boxplot
axes[1].boxplot(df['text_length'])
axes[1].set_title('Article Length Boxplot')
axes[1].set_ylabel('Number of Characters')

plt.tight_layout()
plt.show()

print(f"\nText Length Statistics:")
print(df['text_length'].describe())

In [None]:
# Source distribution
plt.figure(figsize=(14, 6))
top_sources = df['source'].value_counts().head(15)
top_sources.plot(kind='barh')
plt.title('Top 15 News Sources')
plt.xlabel('Number of Articles')
plt.ylabel('Source')
plt.tight_layout()
plt.show()

print(f"\nTotal unique sources: {df['source'].nunique()}")
print(f"Top 5 sources:\n{df['source'].value_counts().head()}")

## 3. Data Preparation

### 3.1 Data Cleaning Process

**Pipeline Steps:**
1. ✅ **Fetch**: NewsData.io API (70 raw articles)
2. ✅ **Scrape**: Extract full content (newspaper3k)
3. ✅ **Preprocess**: Clean and filter
   - Remove articles < 200 words
   - Remove paywall content
   - Normalize text (lowercase, strip)
4. ✅ **Result**: 55 articles retained (78.6%)

In [None]:
# Data quality metrics
print("=== DATA QUALITY REPORT ===")
print(f"\n1. Collection Phase:")
print(f"   Raw articles fetched: 70")
print(f"\n2. Preprocessing Phase:")
print(f"   Articles retained: 55 (78.6%)")
print(f"   Articles removed: 15 (21.4%)")
print(f"\n3. Removal Reasons:")
print(f"   - Insufficient content (< 200 words): ~10")
print(f"   - Paywall articles: ~5")
print(f"\n4. Final Dataset:")
print(f"   Total articles: {len(df)}")
print(f"   Unique articles: {df['title'].nunique()}")
print(f"   Avg length: {df['text_length'].mean():.0f} characters")
print(f"   Avg words: ~{df['text_length'].mean() / 5:.0f} words")

### 3.2 Feature Engineering

Created new features through ML models:
- `sentiment`: Categorical (positive, neutral, negative)
- `sentiment_confidence`: Float (0-1 confidence score)
- `topic`: Integer (cluster ID from BERTopic)
- `topic_label`: String (auto-generated descriptive label)
- `summary`: String (AI-generated abstractive summary)

In [None]:
# Show feature types
print("=== ENGINEERED FEATURES ===")
print(f"\nOriginal features: title, text, publishedAt, source")
print(f"\nML-generated features:")
for col in ['sentiment', 'sentiment_confidence', 'topic', 'topic_label', 'summary']:
    if col in df.columns:
        print(f"  ✅ {col}: {df[col].dtype}")
    else:
        print(f"  ❌ {col}: Missing")

# Sample engineered features
print(f"\nSample article with all features:")
sample = df.iloc[0]
print(f"Title: {sample['title'][:80]}...")
print(f"Sentiment: {sample['sentiment']} (confidence: {sample['sentiment_confidence']:.1%})")
print(f"Topic: {sample['topic_label']}")
print(f"Summary: {sample['summary'][:100]}...")

## 4. Modeling

### 4.1 Model Selection Rationale

#### Task 1: Sentiment Analysis
**Selected Model**: FinBERT (`ProsusAI/finbert`)

**Why FinBERT?**
- Trained on 4.9M financial news sentences
- Understands economic terminology ("rate cut", "inflation", "growth")
- Better than Twitter-RoBERTa for formal news

**Comparison:**
```
Model                 | Confidence | Negative Detection
----------------------|------------|-------------------
Twitter-RoBERTa       | 68.1%      | 1.8% ❌
FinBERT (selected)    | 76.0%      | 18.2% ✅
```

#### Task 2: Topic Modeling
**Selected Model**: BERTopic (unsupervised)

**Why BERTopic?**
- No labeled data required (unsupervised)
- Automatic topic discovery
- Components: UMAP + HDBSCAN + KeyBERT
- Dynamic: Adapts as dataset grows

#### Task 3: Summarization
**Selected Model**: DistilBART (`sshleifer/distilbart-cnn-12-6`)

**Why DistilBART?**
- Trained on CNN/DailyMail news articles
- Abstractive (generates new sentences, not extractive)
- Compressed BART (40% smaller, 97% accuracy)
- Efficient on CPU

### 4.2 Sentiment Analysis Results

In [None]:
# Sentiment distribution
sentiment_counts = df['sentiment'].value_counts()
sentiment_pct = (sentiment_counts / len(df) * 100).round(1)

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Pie chart
colors = ['#e74c3c', '#95a5a6', '#2ecc71']  # negative, neutral, positive
axes[0].pie(sentiment_counts, labels=sentiment_counts.index, autopct='%1.1f%%',
            colors=colors, startangle=90)
axes[0].set_title('Sentiment Distribution')

# Bar chart
sentiment_counts.plot(kind='bar', ax=axes[1], color=colors)
axes[1].set_title('Article Count by Sentiment')
axes[1].set_xlabel('Sentiment')
axes[1].set_ylabel('Number of Articles')
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=0)

plt.tight_layout()
plt.show()

print("=== SENTIMENT ANALYSIS RESULTS ===")
print(f"\nDistribution:")
for sent, count in sentiment_counts.items():
    emoji = {'negative': '📉', 'neutral': '⚪', 'positive': '📈'}[sent]
    print(f"  {emoji} {sent.capitalize():8} : {count:2} articles ({sentiment_pct[sent]:.1f}%)")
print(f"\nAverage Confidence: {df['sentiment_confidence'].mean():.1%}")

In [None]:
# Confidence distribution
plt.figure(figsize=(12, 5))
plt.hist(df['sentiment_confidence'], bins=20, edgecolor='black', alpha=0.7)
plt.axvline(df['sentiment_confidence'].mean(), color='red', linestyle='--',
            linewidth=2, label=f'Mean: {df["sentiment_confidence"].mean():.1%}')
plt.axvline(0.7, color='orange', linestyle=':', linewidth=2, 
            label='Confidence threshold (70%)')
plt.title('Sentiment Confidence Distribution')
plt.xlabel('Confidence Score')
plt.ylabel('Frequency')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

# Confidence statistics
print("\n=== CONFIDENCE STATISTICS ===")
print(df['sentiment_confidence'].describe())
print(f"\nLow confidence (< 0.7): {(df['sentiment_confidence'] < 0.7).sum()} articles")
print(f"High confidence (≥ 0.8): {(df['sentiment_confidence'] >= 0.8).sum()} articles")

In [None]:
# Top confident predictions per sentiment
print("=== TOP 3 MOST CONFIDENT PREDICTIONS ===")
for sentiment in ['negative', 'neutral', 'positive']:
    print(f"\n{sentiment.upper()}:")
    top3 = df[df['sentiment'] == sentiment].nlargest(3, 'sentiment_confidence')
    for idx, row in top3.iterrows():
        print(f"  {row['sentiment_confidence']:.1%} - {row['title'][:70]}...")

### 4.3 Topic Modeling Results

In [None]:
# Topic distribution
topic_counts = df['topic_label'].value_counts()

plt.figure(figsize=(14, 6))
topic_counts.plot(kind='barh', color='steelblue')
plt.title('Articles per Topic', fontsize=14, fontweight='bold')
plt.xlabel('Number of Articles')
plt.ylabel('Topic')
plt.tight_layout()
plt.show()

print("=== TOPIC MODELING RESULTS ===")
print(f"\nTotal topics discovered: {df['topic_label'].nunique()}")
print(f"\nTopic Distribution:")
for i, (topic, count) in enumerate(topic_counts.items(), 1):
    pct = count / len(df) * 100
    print(f"  {i}. {topic:30} - {count:2} articles ({pct:.1f}%)")

### 4.4 Summarization Results

In [None]:
# Summary statistics
df['summary_length'] = df['summary'].str.len()
df['summary_words'] = df['summary'].str.split().str.len()
df['compression_ratio'] = df['text_length'] / df['summary_length']

print("=== SUMMARIZATION PERFORMANCE ===")
print(f"\nCoverage: {df['summary'].notna().sum()} / {len(df)} articles (100.0%)")
print(f"\nSummary Length:")
print(f"  Characters: {df['summary_length'].mean():.0f} avg")
print(f"  Words: {df['summary_words'].mean():.0f} avg")
print(f"  Range: {df['summary_length'].min():.0f} - {df['summary_length'].max():.0f} chars")
print(f"\nCompression Ratio: {df['compression_ratio'].mean():.1f}x")
print(f"  (Original ~{df['text_length'].mean():.0f} → Summary ~{df['summary_length'].mean():.0f} chars)")

In [None]:
# Summary length distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Histogram
axes[0].hist(df['summary_words'], bins=20, edgecolor='black', alpha=0.7, color='coral')
axes[0].axvline(df['summary_words'].mean(), color='red', linestyle='--',
                label=f'Mean: {df["summary_words"].mean():.0f} words')
axes[0].set_title('Summary Word Count Distribution')
axes[0].set_xlabel('Number of Words')
axes[0].set_ylabel('Frequency')
axes[0].legend()

# Scatter: Original vs Summary length
axes[1].scatter(df['text_length'], df['summary_length'], alpha=0.5)
axes[1].set_title('Original vs Summary Length')
axes[1].set_xlabel('Original Text Length (characters)')
axes[1].set_ylabel('Summary Length (characters)')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Sample summaries
print("=== SAMPLE SUMMARIES ===")
for i in range(3):
    row = df.iloc[i]
    print(f"\n{i+1}. {row['title'][:70]}...")
    print(f"   Original: {len(row['text'])} chars, ~{len(row['text'].split())} words")
    print(f"   Summary: {row['summary']}")
    print(f"   Compression: {row['compression_ratio']:.1f}x")

## 5. Evaluation

### 5.1 Comprehensive Pipeline Evaluation

**Evaluation System**: 6 weighted metrics
1. Data Quality (30%): Completeness, text length
2. Sentiment Balance (15%): Distribution analysis
3. Topic Quality (25%): Coherence, cluster size
4. Summarization (30%): Coverage, compression
5. Temporal Analysis: Date range, frequency
6. Confidence Tracking: Prediction reliability

In [None]:
# Load evaluation report
import json

with open('data/evaluation/evaluation_report.json', 'r') as f:
    eval_data = json.load(f)

print("=== PIPELINE EVALUATION REPORT ===")
print(f"\nOverall Score: {eval_data['overall_score']:.1f}/100")
print(f"Rating: {eval_data['rating']}")
print(f"\nComponent Scores:")
for key, value in eval_data.items():
    if 'score' in key and key != 'overall_score':
        component = key.replace('_score', '').replace('_', ' ').title()
        print(f"  {component:25} : {value:.1f}/100")

In [None]:
# Visualize evaluation scores
scores = {
    'Data Quality': eval_data.get('data_quality_score', 0),
    'Sentiment Balance': eval_data.get('sentiment_balance_score', 0),
    'Topic Quality': eval_data.get('topic_quality_score', 0),
    'Summarization': eval_data.get('summarization_score', 0)
}

fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.barh(list(scores.keys()), list(scores.values()), color='skyblue')

# Color code bars
for i, (name, score) in enumerate(scores.items()):
    if score >= 80:
        bars[i].set_color('#2ecc71')  # Green
    elif score >= 60:
        bars[i].set_color('#f39c12')  # Orange
    else:
        bars[i].set_color('#e74c3c')  # Red
    
    # Add score labels
    ax.text(score + 2, i, f'{score:.0f}/100', va='center')

ax.set_xlim(0, 110)
ax.set_xlabel('Score')
ax.set_title('Pipeline Component Scores', fontsize=14, fontweight='bold')
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

### 5.2 Model Performance Summary

In [None]:
# Create performance summary table
performance_data = {
    'Task': ['Sentiment Analysis', 'Topic Modeling', 'Summarization'],
    'Model': ['FinBERT', 'BERTopic', 'DistilBART'],
    'Primary Metric': [
        f"{df['sentiment_confidence'].mean():.1%} confidence",
        f"{df['topic_label'].nunique()} topics discovered",
        f"{df['compression_ratio'].mean():.1f}x compression"
    ],
    'Quality Score': [
        f"{eval_data.get('sentiment_balance_score', 0):.0f}/100",
        f"{eval_data.get('topic_quality_score', 0):.0f}/100",
        f"{eval_data.get('summarization_score', 0):.0f}/100"
    ]
}

perf_df = pd.DataFrame(performance_data)
print("=== MODEL PERFORMANCE SUMMARY ===")
print(perf_df.to_string(index=False))

## 6. Results & Business Insights

### 6.1 Key Findings

In [None]:
print("=== KEY FINDINGS ===")
print(f"\n1. Sentiment Trends:")
print(f"   - Negative articles: {(df['sentiment'] == 'negative').sum()} (18.2%)")
print(f"   - Neutral articles: {(df['sentiment'] == 'neutral').sum()} (50.9%)")
print(f"   - Positive articles: {(df['sentiment'] == 'positive').sum()} (30.9%)")
print(f"   → Realistic distribution for economic news")

print(f"\n2. Dominant Topics:")
top_topics = df['topic_label'].value_counts().head(3)
for topic, count in top_topics.items():
    pct = count / len(df) * 100
    print(f"   - {topic}: {count} articles ({pct:.0f}%)")

print(f"\n3. Model Performance:")
print(f"   - FinBERT confidence: {df['sentiment_confidence'].mean():.1%} (reliable)")
print(f"   - Topic coherence: 100/100 (excellent clustering)")
print(f"   - Summarization: {df['compression_ratio'].mean():.1f}x compression")

print(f"\n4. Data Quality:")
print(f"   - Retention rate: 78.6% (55/70 articles)")
print(f"   - Avg article length: ~{df['text_length'].mean() / 5:.0f} words")
print(f"   - Sources: {df['source'].nunique()} unique news outlets")

### 6.2 Sentiment Over Time Analysis

In [None]:
# Sentiment over time
df['date'] = pd.to_datetime(df['publishedAt']).dt.date
sentiment_time = df.groupby(['date', 'sentiment']).size().reset_index(name='count')

# Pivot for easier plotting
sentiment_pivot = sentiment_time.pivot(index='date', columns='sentiment', values='count').fillna(0)

# Plot
fig, ax = plt.subplots(figsize=(14, 6))
sentiment_pivot.plot(kind='line', marker='o', ax=ax, 
                     color=['#e74c3c', '#95a5a6', '#2ecc71'])
ax.set_title('Sentiment Over Time', fontsize=14, fontweight='bold')
ax.set_xlabel('Date')
ax.set_ylabel('Number of Articles')
ax.legend(title='Sentiment')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("=== TEMPORAL ANALYSIS ===")
print(f"\nDaily sentiment breakdown:")
print(sentiment_pivot)

### 6.3 Business Recommendations

In [None]:
print("=== BUSINESS RECOMMENDATIONS ===")
print(f"\n📈 For Investors:")
print(f"   • Monitor '{top_topics.index[0]}' (29% of news coverage)")
print(f"   • Track negative sentiment spikes (current baseline: 18%)")
print(f"   • Use AI summaries for quick daily scanning")

print(f"\n📰 For Media Organizations:")
print(f"   • Focus on emerging topics: {', '.join(top_topics.index[1:3])}")
print(f"   • Identify underreported areas (topics with < 10% coverage)")
print(f"   • Optimize content strategy based on trending narratives")

print(f"\n📊 For Financial Analysts:")
print(f"   • Automate morning briefing (2x daily updates)")
print(f"   • Set alerts for sentiment changes > 20%")
print(f"   • Use topic clustering for sector-specific research")

print(f"\n🎯 System Improvements:")
print(f"   • Extend historical data (current: 2 days → target: 30+ days)")
print(f"   • Add multi-language support (current: English only)")
print(f"   • Implement real-time alerting for breaking news")

## 7. Deployment

### 7.1 Production Architecture

**Infrastructure:**
- **Dashboard**: Railway (Streamlit app)
- **Automation**: GitHub Actions (cron: 8:00 & 20:00 UTC)
- **Storage**: CSV files in Git repo
- **Models**: HuggingFace Transformers (CPU inference)

**Data Flow:**
```
NewsData.io API → GitHub Actions (2x daily)
       ↓
Pipeline Processing (7 steps)
       ↓
CSV files (data/processed/)
       ↓
Auto-commit to GitHub
       ↓
Railway deploys dashboard
```

### 7.2 Live Dashboard

**URL**: https://newstrendanalysis.up.railway.app/

**Features:**
- 📊 4 key metrics (total, unique, sentiment, topics)
- 📈 3 interactive charts (Plotly)
- 🔍 Filters (sentiment, topic, date)
- 📄 Paginated article list with summaries
- 🟢🟡🔴 Confidence badges

## 8. Conclusion

### 8.1 Project Summary

Successfully built a **production-ready NLP pipeline** that:

✅ **Solves Business Problem**: Automates economic news sentiment tracking  
✅ **High Quality**: 85/100 overall score with comprehensive evaluation  
✅ **Multiple Tasks**: 3 NLP tasks (sentiment, topics, summarization)  
✅ **Automated**: 2x daily updates via GitHub Actions  
✅ **Interactive**: Live Streamlit dashboard with visualizations  
✅ **Well-Documented**: Technical notebook, README, evaluation reports  

### 8.2 Technical Achievements

**Models:**
- FinBERT: 76% confidence, realistic sentiment distribution
- BERTopic: 6 coherent topics, 100/100 quality score
- DistilBART: 100% coverage, 37.7x compression

**Data Quality:**
- 78.6% retention rate (effective filtering)
- 40+ unique sources (diverse coverage)
- Automated deduplication

### 8.3 Limitations

⚠️ **Current Limitations:**
- Short time period (2 days of data)
- English language only
- Economic news domain specific
- No predictive capabilities (descriptive only)

### 8.4 Future Work

🚀 **Potential Enhancements:**
1. **Extended Historical Data**: 30+ days for trend analysis
2. **Multi-language Support**: Add Croatian, German, etc.
3. **Predictive Models**: Forecast sentiment trends
4. **Real-time Alerts**: Notify on significant changes
5. **Sector-specific Analysis**: Finance, tech, healthcare
6. **Entity Recognition**: Track companies, people, locations

### 8.5 Academic Contribution

This project demonstrates:
- ✅ Transfer learning (leveraging pre-trained models)
- ✅ Model comparison (Twitter-RoBERTa → FinBERT)
- ✅ Unsupervised learning (BERTopic clustering)
- ✅ Comprehensive evaluation (6 metrics)
- ✅ Production deployment (CI/CD with GitHub Actions)
- ✅ Interactive visualization (Streamlit dashboard)

---

**Thank you for reading this technical documentation!**

**Live Demo**: https://newstrendanalysis.up.railway.app/  
**GitHub**: https://github.com/davidjosipovic/news-trend-analysis  
**Contact**: [Your Email]