# Financial News NLP — Final Summary & Takeaways

This notebook synthesizes results from the full NLP pipeline:
- exploratory analysis
- text cleaning
- clustering
- topic modeling

The goal is to clearly communicate what we learned, why it matters,
and how this work could be extended in the future.

## Project Motivation

Financial markets are influenced not only by prices and fundamentals,
but also by narratives conveyed through news and media.

This project asks:
**Can natural language processing uncover recurring market narratives
from financial news headlines, and do different news sources emphasize
different themes?**

Rather than predicting prices, the focus is on extracting
interpretable insights from market language.

## Data Overview

We analyzed financial news headlines from three major outlets:
- **CNBC** — market commentary and stock-focused reporting
- **Reuters** — global, fact-driven financial journalism
- **The Guardian** — broader economic and political framing

Each dataset consists of headline text and publication metadata.
After cleaning and deduplication, the final dataset contained
tens of thousands of unique headlines.

## Methodology Overview

The analysis followed a structured NLP pipeline:

1. **Exploratory Data Analysis**
   - inspected schema and source balance
   - identified headline text as the core NLP input

2. **Text Cleaning & Normalization**
   - standardized casing and punctuation
   - removed noise (URLs, duplicates)
   - produced a reusable cleaned dataset

3. **Baseline Clustering (TF-IDF + KMeans)**
   - converted text to interpretable numerical features
   - identified coarse thematic groupings
   - compared themes across sources

4. **Topic Modeling (LDA)**
   - extracted dominant narrative themes
   - validated topics using real headline examples
   - measured narrative emphasis by source

## Key Findings

Several clear and interpretable market narratives emerged:

- **Macro Economy & COVID Impact**
  Headlines focused on economic growth, layoffs, and global disruption,
  particularly during periods of uncertainty.

- **Crisis, Brexit & Financial Stability**
  UK- and Europe-focused reporting emphasized institutional stress,
  banking exposure, and policy uncertainty.

- **Market Commentary & Stock Picks**
  CNBC headlines clustered around analyst opinions, earnings reactions,
  and stock-level recommendations.

- **Big Tech, IPOs & Corporate Deals**
  Large-cap technology firms, IPO activity, and mergers formed a
  distinct and consistent narrative.

Topic distributions differed meaningfully across news sources,
reflecting differences in editorial focus and framing.

## What Worked Well

- TF-IDF provided a strong, interpretable baseline for clustering
- Topic modeling produced human-readable narratives despite short text
- Source comparisons revealed meaningful framing differences
- Conservative text cleaning preserved semantic meaning

## Limitations

- Headlines are short and lack full article context
- Some topics are broad or overlap due to shared financial vocabulary
- Topic count selection is heuristic rather than definitive
- No direct linkage to market outcomes or asset prices

These limitations are expected for headline-level NLP
and are documented intentionally.

## Future Work

Possible extensions include:
- analyzing full article text instead of headlines
- applying embedding-based topic models (e.g., BERTopic)
- aligning narratives with market volatility or returns
- building a lightweight dashboard for narrative exploration

These were intentionally scoped out of the current project
to maintain analytical clarity.

## Final Takeaway

This project demonstrates how NLP can be used to extract
interpretable market narratives from financial news.

By focusing on clarity, validation, and communication,
the analysis provides insights into how financial stories
are framed across major media outlets — without overclaiming
predictive power.