# Financial News and Stock Price Integration Dataset - EDA
## Nova Financial Solutions

This notebook performs comprehensive Exploratory Data Analysis on the Financial News dataset.

**Dataset Requirements:**
- `headline`: The financial news headline
- `url`: Link to the full article
- `publisher`: Author or news source
- `date`: Publication date and time (UTC-4 timezone)
- `stock`: Stock ticker symbol (e.g., AAPL)


In [None]:
import sys
from pathlib import Path

# Add src to path
project_root = Path().resolve().parent
src_path = str(project_root / 'src')
if src_path not in sys.path:
    sys.path.insert(0, src_path)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Import all analysis modules
from data_loader import DataLoader
from eda_analyzer import EDAAnalyzer
from topic_modeling import TopicModeler
from publisher_analyzer import PublisherAnalyzer

# Store in globals for easy access in other cells
globals()['DataLoader'] = DataLoader
globals()['EDAAnalyzer'] = EDAAnalyzer
globals()['TopicModeler'] = TopicModeler
globals()['PublisherAnalyzer'] = PublisherAnalyzer
globals()['project_root'] = project_root

print("Libraries imported successfully!")
print(f"Project root: {project_root}")
print("All modules are now available in this notebook session.")


Libraries imported successfully!
Project root: C:\Users\HomePC\Desktop\Second\Predicting-Price-Moves-with-News-Sentiment


## 1. Load and Preprocess Data


In [None]:
# Update this path to your dataset location
# You can use a relative path from the project root or an absolute path
# Default: looks for file in the project's data/ folder
DATA_PATH = "data/raw_analyst_ratings.csv"

# Ensure imports and project_root are available (works even if cells are run out of order)
import sys
from pathlib import Path

# Define project_root (notebook is in notebooks/ folder, so parent is project root)
project_root = Path().resolve().parent
print(f"Project root: {project_root}")

# Add src to path if not already added
src_path = str(project_root / 'src')
if src_path not in sys.path:
    sys.path.insert(0, src_path)
    print(f"Added to sys.path: {src_path}")
else:
    print(f"Path already in sys.path: {src_path}")

# Verify src directory exists
src_dir = project_root / 'src'
data_loader_file = src_dir / 'data_loader.py'
print(f"Looking for data_loader.py at: {data_loader_file}")

if not src_dir.exists():
    raise FileNotFoundError(f"Source directory not found at {src_dir}. Please ensure you're running from the correct location.")

if not data_loader_file.exists():
    raise FileNotFoundError(f"data_loader.py not found at {data_loader_file}. Please ensure the file exists.")

# Import DataLoader
print("Importing DataLoader...")
try:
    from data_loader import DataLoader
    print("✓ DataLoader imported successfully!")
except ImportError as e:
    print(f"❌ Import failed. sys.path: {sys.path[:3]}...")  # Show first 3 paths
    raise ImportError(
        f"Failed to import DataLoader from {src_path}. "
        f"Please ensure the src/data_loader.py file exists. "
        f"Error: {e}"
    )

# Convert to absolute path for better reliability
# Handle both relative paths (from project root) and absolute paths
if Path(DATA_PATH).is_absolute():
    data_path_abs = Path(DATA_PATH)
else:
    data_path_abs = (project_root / DATA_PATH).resolve()

# Check if file exists
print("=" * 80)
print("DATASET LOADING")
print("=" * 80)
print(f"Looking for dataset at: {data_path_abs}")

if not data_path_abs.exists():
    print("\n❌ ERROR: Dataset file not found!")
    print(f"\nLooking for file at: {data_path_abs}")
    print(f"\nPlease do ONE of the following:")
    print(f"\n1. Place your dataset CSV file at:")
    print(f"   {data_path_abs}")
    print(f"\n2. Or update the DATA_PATH variable above to point to your dataset.")
    print(f"   For example:")
    print(f"   DATA_PATH = 'data/your_file.csv'  # relative to project root")
    print(f"   DATA_PATH = 'C:/path/to/your/file.csv'  # absolute path")
    
    # Check if data directory exists
    data_dir = project_root / "data"
    if data_dir.exists():
        print(f"\n✓ Found data directory at: {data_dir}")
        print(f"  Files in data directory:")
        try:
            files = list(data_dir.glob("*.csv"))
            if files:
                for f in files[:5]:  # Show first 5 CSV files
                    print(f"    - {f.name}")
                if len(files) > 5:
                    print(f"    ... and {len(files) - 5} more CSV files")
            else:
                print(f"    (no CSV files found)")
        except:
            pass
    else:
        print(f"\n⚠ Data directory not found at: {data_dir}")
        print(f"  You may need to create it first.")
    
    print(f"\nExpected columns in the dataset:")
    print("  - headline: The financial news headline")
    print("  - url: Link to the full article")
    print("  - publisher: Author or news source")
    print("  - date: Publication date and time (UTC-4 timezone)")
    print("  - stock: Stock ticker symbol (e.g., AAPL)")
    
    raise FileNotFoundError(f"Dataset not found at {data_path_abs}\nPlease place your dataset CSV file at the specified location or update DATA_PATH.")

# Load data
print("\n✓ Dataset file found!")
print("Loading dataset...")
loader = DataLoader(str(data_path_abs))
df = loader.load_data()
df = loader.preprocess_data()

print(f"\n✓ Loaded {len(df):,} articles")
print(f"✓ Date range: {df['date'].min()} to {df['date'].max()}")
print(f"✓ Unique publishers: {df['publisher'].nunique()}")
print(f"✓ Unique stocks: {df['stock'].nunique()}")

# Display first few rows
print("\nFirst 5 rows:")
df.head()


DATASET LOADING
Looking for dataset at: C:\Users\HomePC\Desktop\Second\Predicting-Price-Moves-with-News-Sentiment\data\raw_analyst_ratings.csv

✓ Dataset file found!
Loading dataset...
Removed 1351341 rows with missing critical data

✓ Loaded 55,987 articles
✓ Date range: 2011-04-28 01:01:48+00:00 to 2020-06-11 21:12:35+00:00
✓ Unique publishers: 225
✓ Unique stocks: 6204

First 5 rows:


Unnamed: 0.1,Unnamed: 0,headline,url,publisher,date,stock,year,month,day,day_of_week,hour,date_only,headline_length,headline_word_count,publisher_domain
0,0,Stocks That Hit 52-Week Highs On Friday,https://www.benzinga.com/news/20/06/16190091/s...,Benzinga Insights,2020-06-05 14:30:54+00:00,A,2020.0,6.0,5.0,Friday,14.0,2020-06-05,39,7,Benzinga Insights
1,1,Stocks That Hit 52-Week Highs On Wednesday,https://www.benzinga.com/news/20/06/16170189/s...,Benzinga Insights,2020-06-03 14:45:20+00:00,A,2020.0,6.0,3.0,Wednesday,14.0,2020-06-03,42,7,Benzinga Insights
2,2,71 Biggest Movers From Friday,https://www.benzinga.com/news/20/05/16103463/7...,Lisa Levin,2020-05-26 08:30:07+00:00,A,2020.0,5.0,26.0,Tuesday,8.0,2020-05-26,29,5,Lisa Levin
3,3,46 Stocks Moving In Friday's Mid-Day Session,https://www.benzinga.com/news/20/05/16095921/4...,Lisa Levin,2020-05-22 16:45:06+00:00,A,2020.0,5.0,22.0,Friday,16.0,2020-05-22,44,7,Lisa Levin
4,4,B of A Securities Maintains Neutral on Agilent...,https://www.benzinga.com/news/20/05/16095304/b...,Vick Meyer,2020-05-22 15:38:59+00:00,A,2020.0,5.0,22.0,Friday,15.0,2020-05-22,87,14,Vick Meyer


## 2. Descriptive Statistics


In [None]:
# Ensure imports are available (works even if cells are run out of order)
try:
    from eda_analyzer import EDAAnalyzer
except ImportError:
    import sys
    from pathlib import Path
    project_root = Path().resolve().parent
    src_path = str(project_root / 'src')
    if src_path not in sys.path:
        sys.path.insert(0, src_path)
    from eda_analyzer import EDAAnalyzer

# Initialize EDA analyzer
eda = EDAAnalyzer(df, "../output")
stats = eda.compute_descriptive_stats()

print("=" * 80)
print("DESCRIPTIVE STATISTICS")
print("=" * 80)
print(f"\nDataset Overview:")
print(f"  Total Articles: {stats['total_articles']:,}")
print(f"  Unique Publishers: {stats['unique_publishers']}")
print(f"  Unique Stocks: {stats['unique_stocks']}")
print(f"  Date Range: {stats['date_range']['start']} to {stats['date_range']['end']}")

print(f"\nHeadline Length Statistics (Characters):")
print(f"  Min: {stats['headline_length']['min']}")
print(f"  Max: {stats['headline_length']['max']}")
print(f"  Mean: {stats['headline_length']['mean']:.2f}")
print(f"  Median: {stats['headline_length']['median']:.2f}")
print(f"  Std Dev: {stats['headline_length']['std']:.2f}")
print(f"  Q25: {stats['headline_length']['q25']:.2f}")
print(f"  Q75: {stats['headline_length']['q75']:.2f}")

print(f"\nHeadline Word Count Statistics:")
print(f"  Min: {stats['headline_word_count']['min']}")
print(f"  Max: {stats['headline_word_count']['max']}")
print(f"  Mean: {stats['headline_word_count']['mean']:.2f}")
print(f"  Median: {stats['headline_word_count']['median']:.2f}")
print(f"  Std Dev: {stats['headline_word_count']['std']:.2f}")


DESCRIPTIVE STATISTICS

Dataset Overview:
  Total Articles: 55,987
  Unique Publishers: 225
  Unique Stocks: 6204
  Date Range: 2011-04-28 01:01:48+00:00 to 2020-06-11 21:12:35+00:00

Headline Length Statistics (Characters):
  Min: 12
  Max: 512
  Mean: 80.02
  Median: 63.00
  Std Dev: 56.13
  Q25: 42.00
  Q75: 91.00

Headline Word Count Statistics:
  Min: 2
  Max: 77
  Mean: 12.44
  Median: 10.00
  Std Dev: 8.46


In [None]:
# Ensure matplotlib is imported
import matplotlib.pyplot as plt

# Visualize headline length distribution
print("Generating headline length distribution visualizations...")
eda.plot_headline_length_distribution(save=False)
plt.show()


Generating headline length distribution visualizations...


## 3. Top Publishers Analysis


In [12]:
# Analyze top publishers
top_publishers = eda.analyze_top_publishers(10)

print("=" * 80)
print("TOP 10 PUBLISHERS BY ARTICLE COUNT")
print("=" * 80)
print(top_publishers.to_string(index=False))


TOP 10 PUBLISHERS BY ARTICLE COUNT
        publisher  article_count  unique_stocks             first_article              last_article
Benzinga Newsdesk          14750           3771 2016-06-23 12:56:09+00:00 2020-06-11 21:11:20+00:00
       Lisa Levin          12408           3646 2011-06-07 13:49:32+00:00 2020-06-11 16:19:17+00:00
    ETF Professor           4362           1010 2011-04-28 01:01:48+00:00 2020-06-11 19:25:36+00:00
    Paul Quintaro           4212           1242 2011-05-17 14:06:30+00:00 2018-05-31 19:49:22+00:00
Benzinga Newsdesk           3177            956 2020-05-12 16:00:37+00:00 2020-06-11 18:26:26+00:00
Benzinga Insights           2332           1202 2020-03-24 15:35:03+00:00 2020-06-11 20:24:41+00:00
       Vick Meyer           2128           1228 2018-02-06 13:46:03+00:00 2020-06-03 14:08:51+00:00
    Charles Gross           1790           1023 2011-11-30 12:34:52+00:00 2020-06-11 15:08:26+00:00
       Hal Lindon           1470            844 2013-06-17 12:15:

In [None]:
# Visualize top publishers
eda.plot_top_publishers(10, save=False)
plt.show()


## 4. Publication Frequency Analysis


In [None]:
# Analyze publication frequency
freq_stats = eda.analyze_publication_frequency()

print("=" * 80)
print("PUBLICATION FREQUENCY INSIGHTS")
print("=" * 80)

# Most active day
max_day = max(freq_stats['by_day'].items(), key=lambda x: x[1])
print(f"\nMost Active Day of Week: {max_day[0]} ({max_day[1]} articles)")

# Most active month
max_month = max(freq_stats['by_month'].items(), key=lambda x: x[1])
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
print(f"Most Active Month: {month_names[max_month[0]-1]} ({max_month[1]} articles)")

# Peak hour
max_hour = max(freq_stats['by_hour'].items(), key=lambda x: x[1])
print(f"Peak Hour: {max_hour[0]:02d}:00 ({max_hour[1]} articles)")

# Visualize
eda.plot_publication_frequency(save=False)
plt.show()


PUBLICATION FREQUENCY INSIGHTS

Most Active Day of Week: Thursday (12712 articles)
Most Active Month: May (11364 articles)
Peak Hour: 14:00 (7669 articles)


### 4.1 News Activity Spikes Detection


In [None]:
# Detect news spikes (articles > 2 standard deviations above mean)
spikes = eda.detect_news_spikes(threshold_std=2.0)

print("=" * 80)
print(f"NEWS ACTIVITY SPIKES (>2σ above mean)")
print("=" * 80)
print(f"Total Spikes Detected: {len(spikes)}")

if len(spikes) > 0:
    print("\nTop 10 Spikes:")
    print(spikes.head(10).to_string(index=False))
    
    # Visualize spikes
    eda.plot_news_spikes(threshold_std=2.0, save=False)
    plt.show()
else:
    print("No significant spikes detected.")


NEWS ACTIVITY SPIKES (>2σ above mean)
Total Spikes Detected: 51

Top 10 Spikes:
      date  article_count  deviation_from_mean  std_deviation
2020-03-12            973           950.623102      13.883485
2020-06-05            932           909.623102      13.284696
2020-06-10            807           784.623102      11.459119
2020-06-09            803           780.623102      11.400700
2020-06-08            765           742.623102      10.845725
2020-05-07            751           728.623102      10.641260
2020-06-03            720           697.623102      10.188517
2020-03-19            630           607.623102       8.874102
2020-05-26            628           605.623102       8.844893
2020-05-13            549           526.623102       7.691128


## 5. Topic Modeling - Frequent Keywords


In [None]:
# Initialize topic modeler
topic_modeler = TopicModeler(df, "../output")

# Extract frequent keywords
print("Extracting frequent keywords...")
keywords_df = topic_modeler.extract_frequent_keywords(50)

print("=" * 80)
print("TOP 50 FREQUENT KEYWORDS")
print("=" * 80)
print(keywords_df.head(20).to_string(index=False))


Extracting frequent keywords...
Preprocessing headlines...
Creating dictionary...
Creating corpus...
Corpus prepared: 55987 documents, 6633 unique terms
TOP 50 FREQUENT KEYWORDS
  keyword  frequency  percentage
     week       9090    2.211878
      hit       5925    1.441736
      low       5660    1.377253
      eps       5530    1.345620
   target       4695    1.142439
  several       4650    1.131489
     sale       4616    1.123215
    lower       4503    1.095719
      etf       4497    1.094259
   higher       4269    1.038780
 estimate       4090    0.995223
  session       3486    0.848252
maintains       3266    0.794719
     high       3090    0.751893
yesterday       3068    0.746539
 thursday       2927    0.712230
   moving       2867    0.697630
     amid       2672    0.650180
   friday       2634    0.640934
following       2525    0.614411


In [None]:
# Visualize frequent keywords
topic_modeler.plot_frequent_keywords(30, save=False)
plt.show()


## 6. Topic Modeling - LDA (Latent Dirichlet Allocation)


In [None]:
# Train LDA model
print("=" * 80)
print("TRAINING LDA MODEL")
print("=" * 80)
print("This may take a few minutes...")

lda_model = topic_modeler.train_lda(num_topics=10, passes=10)
lda_topics = topic_modeler.get_lda_topics(num_words=10)

print("\n✓ LDA Model Trained Successfully!")
print("\nExtracted Topics:")
print("=" * 80)
for topic_name, topic_data in lda_topics.items():
    print(f"\n{topic_name}:")
    print(f"  Top words: {', '.join(topic_data['top_words'][:10])}")


TRAINING LDA MODEL
This may take a few minutes...
Training LDA model with 10 topics...
LDA model trained successfully!

✓ LDA Model Trained Successfully!

Extracted Topics:

Topic_0:
  Top words: several" , higher" , reopening" , optimism" , following" , amid" , economy" , economic" , equity" , strong

Topic_1:
  Top words: etf" , set" , week" , low" , yesterday" , dividend" , october" , watch" , september" , nov

Topic_2:
  Top words: target" , maintains" , raise" , lower" , buy" , downgrade" , neutral" , morgan" , outperform" , announces

Topic_3:
  Top words: week" , hit" , eps" , sale" , estimate" , low" , high" , yoy" , thursday" , beat

Topic_4:
  Top words: several" , higher" , lower" , amid" , coronavirus" , economic" , following" , state" , sector" , oil

Topic_5:
  Top words: biggest" , mover" , yesterday" , update" , friday" , midafternoon" , midday" , merger" , spike" , acquisition

Topic_6:
  Top words: earnings" , update" , lower" , second" , higher" , wave" , case" , mid

In [None]:
# Visualize LDA topics
topic_modeler.plot_lda_topics(num_words=10, save=False)
plt.show()


In [None]:
# Create interactive LDA visualization (optional)
try:
    print("Creating interactive LDA visualization...")
    vis = topic_modeler.create_lda_visualization(save=True)
    print("✓ Interactive visualization saved to output/lda_visualization.html")
except Exception as e:
    print(f"⚠ Could not create interactive visualization: {e}")


Creating interactive LDA visualization...
Creating LDA visualization...
⚠ Could not create interactive visualization: [Errno 2] No such file or directory: '../output/lda_visualization.html'


## 7. Topic Modeling - BERTopic (Optional, Advanced)


In [None]:
# Train BERTopic model (optional - may take longer)
print("=" * 80)
print("TRAINING BERTOPIC MODEL (Optional)")
print("=" * 80)
print("This may take several minutes...")

try:
    bertopic_model = topic_modeler.train_bertopic(min_topic_size=10)
    
    if bertopic_model:
        bertopic_topics = topic_modeler.get_bertopic_topics()
        print("\n✓ BERTopic Model Trained Successfully!")
        print(f"✓ Found {len(bertopic_topics)} topics")
        
        # Visualize
        topic_modeler.plot_bertopic_topics(save=True)
        print("✓ BERTopic visualizations saved")
        
        # Show sample topics
        print("\nSample BERTopic Topics:")
        for i, (topic_name, topic_data) in enumerate(list(bertopic_topics.items())[:5]):
            print(f"\n{topic_name}:")
            print(f"  Count: {topic_data['count']}")
            print(f"  Top words: {', '.join(topic_data['words'][:10])}")
except Exception as e:
    print(f"⚠ BERTopic training failed or not available: {e}")
    print("Continuing with LDA results only...")


TRAINING BERTOPIC MODEL (Optional)
This may take several minutes...
Training BERTopic model...
This may take a while...


2025-11-23 19:14:22,137 - BERTopic - Embedding - Transforming documents to embeddings.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1750 [00:00<?, ?it/s]

2025-11-23 19:28:35,755 - BERTopic - Embedding - Completed ✓
2025-11-23 19:28:35,761 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-11-23 19:32:39,189 - BERTopic - Dimensionality - Completed ✓
2025-11-23 19:32:39,275 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-11-23 19:32:55,194 - BERTopic - Cluster - Completed ✓
2025-11-23 19:32:55,357 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-11-23 19:33:00,488 - BERTopic - Representation - Completed ✓


BERTopic model trained successfully!

✓ BERTopic Model Trained Successfully!
✓ Found 996 topics
Error creating BERTopic visualizations: [Errno 2] No such file or directory: '..\\output\\bertopic_topics.html'
✓ BERTopic visualizations saved

Sample BERTopic Topics:

Topic_-1:
  Count: 12109
  Top words: reports, yoy, eps, vs, sales, est, deal, q2, cash, q3

Topic_0:
  Count: 1764
  Top words: thursday, lows, hit, that, 52week, stocks, on, , , 

Topic_1:
  Count: 882
  Top words: estimate, beat, beats, miss, misses, adj, sales, eps, q1, inline

Topic_2:
  Count: 716
  Top words: friday, lows, hit, that, 52week, stocks, on, breach, managed, to

Topic_3:
  Count: 540
  Top words: friday, highs, hit, that, 52week, stocks, on, , , 


## 8. Assign Topics to Articles


In [None]:
# Assign LDA topics to articles
df_with_topics = topic_modeler.assign_topics_to_articles(method='lda')

print("=" * 80)
print("TOPIC ASSIGNMENT SUMMARY")
print("=" * 80)
print(f"Total articles with topic assignments: {len(df_with_topics)}")
print("\nTopic distribution:")
print(df_with_topics['lda_topic'].value_counts().sort_index())

# Show sample articles with topics
print("\nSample articles with topic assignments:")
print(df_with_topics[['headline', 'stock', 'publisher', 'lda_topic']].head(10).to_string(index=False))


TOPIC ASSIGNMENT SUMMARY
Total articles with topic assignments: 55987

Topic distribution:
lda_topic
0     2222
1     7015
2     6610
3    11977
4     3562
5     5209
6     3532
7     4396
8     4726
9     6738
Name: count, dtype: int64

Sample articles with topic assignments:
                                                                                                                headline stock               publisher  lda_topic
                                                                                 Stocks That Hit 52-Week Highs On Friday     A       Benzinga Insights          3
                                                                              Stocks That Hit 52-Week Highs On Wednesday     A       Benzinga Insights          3
                                                                                           71 Biggest Movers From Friday     A              Lisa Levin          5
                                                                          

## 9. Publisher Analysis


In [None]:
# Initialize publisher analyzer
publisher_analyzer = PublisherAnalyzer(df, "../output")

# Rank publishers
rankings = publisher_analyzer.rank_publishers()

print("=" * 80)
print("PUBLISHER RANKINGS")
print("=" * 80)
print(rankings.head(15).to_string(index=False))


PUBLISHER RANKINGS
              publisher  article_count  unique_stocks  avg_headline_length  std_headline_length             first_article              last_article  rank
      Benzinga Newsdesk          14750           3771            97.485288            52.961824 2016-06-23 12:56:09+00:00 2020-06-11 21:11:20+00:00     1
             Lisa Levin          12408           3646            44.391441            16.313968 2011-06-07 13:49:32+00:00 2020-06-11 16:19:17+00:00     2
          ETF Professor           4362           1010            44.016735            10.909750 2011-04-28 01:01:48+00:00 2020-06-11 19:25:36+00:00     3
          Paul Quintaro           4212           1242            84.106600            36.620906 2011-05-17 14:06:30+00:00 2018-05-31 19:49:22+00:00     4
      Benzinga Newsdesk           3177            956           226.071766            66.098087 2020-05-12 16:00:37+00:00 2020-06-11 18:26:26+00:00     5
      Benzinga Insights           2332           1202    

In [None]:
# Visualize publisher analysis
publisher_analyzer.plot_publisher_analysis(top_n=10, save=False)
plt.show()


### 9.1 Publisher Domain Analysis


In [None]:
# Extract publisher domains
domains = publisher_analyzer.extract_publisher_domains()

if len(domains) > 0:
    print("=" * 80)
    print("PUBLISHER DOMAIN ANALYSIS")
    print("=" * 80)
    print(f"Total unique domains: {len(domains)}")
    print("\nTop domains by article count:")
    print(domains.head(10).to_string(index=False))
else:
    print("No email-like publisher values found. Using publisher names directly.")


PUBLISHER DOMAIN ANALYSIS
Total unique domains: 4

Top domains by article count:
                publisher       domain  article_count
  vishwanath@benzinga.com benzinga.com            924
        luke@benzinga.com benzinga.com            271
bret.kenwell@benzinga.com benzinga.com              1
vivek.proactive@gmail.com    gmail.com              3


### 9.2 Topic Preferences by Publisher


In [None]:
# Identify topic preferences by publisher
topic_categories = topic_modeler.identify_topic_categories()
topic_prefs = publisher_analyzer.identify_topic_preferences(topic_categories, top_n=10)

print("=" * 80)
print("TOPIC PREFERENCES BY PUBLISHER")
print("=" * 80)
print("Percentage of articles mentioning each topic category:")
print(topic_prefs.to_string(index=False))


TOPIC PREFERENCES BY PUBLISHER
Percentage of articles mentioning each topic category:
        publisher  total_articles  earnings_mentions  earnings_percentage  mergers_mentions  mergers_percentage  fda_approval_mentions  fda_approval_percentage  price_target_mentions  price_target_percentage  product_launch_mentions  product_launch_percentage  partnership_mentions  partnership_percentage  regulation_mentions  regulation_percentage  market_movement_mentions  market_movement_percentage
Benzinga Newsdesk           14750               4737            32.115254               748            5.071186                    787                 5.335593                   2201                14.922034                     1473                   9.986441                   864                5.857627                  661               4.481356                      3027                   20.522034
       Lisa Levin           12408                987             7.954545               147            1.1

In [None]:
# Visualize topic preferences
publisher_analyzer.plot_topic_preferences(topic_categories, top_n=10, save=False)
plt.show()


## 10. Save Results


In [1]:
# Create output directory
from pathlib import Path
output_dir = Path("../output")
output_dir.mkdir(parents=True, exist_ok=True)
(output_dir / "data").mkdir(parents=True, exist_ok=True)

# Save all results
print("=" * 80)
print("SAVING RESULTS")
print("=" * 80)

# Descriptive statistics
stats_df = pd.DataFrame([stats['descriptive']])
stats_df.to_csv(output_dir / "data" / "descriptive_statistics.csv", index=False)
print("✓ Descriptive statistics saved")

# Top publishers
top_publishers.to_csv(output_dir / "data" / "top_publishers.csv", index=False)
print("✓ Top publishers saved")

# Keywords
keywords_df.to_csv(output_dir / "data" / "frequent_keywords.csv", index=False)
print("✓ Frequent keywords saved")

# LDA topics
lda_topics_df = pd.DataFrame([
    {'topic': k, 'words': ', '.join(v['top_words'])} 
    for k, v in lda_topics.items()
])
lda_topics_df.to_csv(output_dir / "data" / "lda_topics.csv", index=False)
print("✓ LDA topics saved")

# Articles with topics
df_with_topics.to_csv(output_dir / "data" / "articles_with_topics.csv", index=False)
print("✓ Articles with topic assignments saved")

# Publisher rankings
rankings.to_csv(output_dir / "data" / "publisher_rankings.csv", index=False)
print("✓ Publisher rankings saved")

# Topic preferences
topic_prefs.to_csv(output_dir / "data" / "publisher_topic_preferences.csv", index=False)
print("✓ Publisher topic preferences saved")

# News spikes
spikes.to_csv(output_dir / "data" / "news_spikes.csv", index=False)
print("✓ News spikes saved")

# Summary report
report = eda.generate_summary_report()
with open(output_dir / "eda_summary_report.txt", 'w', encoding='utf-8') as f:
    f.write(report)
print("✓ Summary report saved")

print(f"\n✓ All results saved to: {output_dir.absolute()}")


SAVING RESULTS


NameError: name 'pd' is not defined

## 11. Summary and Recommendations


In [4]:
# Generate and display summary report
report = eda.generate_summary_report()
print(report)


NameError: name 'eda' is not defined

### Recommendations for Next Steps

**1. Sentiment Analysis Preparation:**
- Use extracted topics to create topic-based sentiment features
- Consider publisher-specific sentiment patterns
- Analyze sentiment trends around news spikes
- Create time-based sentiment features (hour, day of week)

**2. Feature Engineering:**
- Headline length and word count (already extracted)
- Topic labels from LDA/BERTopic
- Publisher credibility/weighting factors
- Temporal features (time since market open, day of week)
- Stock-specific news frequency

**3. Correlation Analysis:**
- Correlate sentiment scores with stock price movements
- Analyze lag effects (news impact on next day/week prices)
- Study publisher influence on market reactions
- Identify which topics drive price movements

**4. Model Preparation:**
- Create train/validation/test splits respecting temporal order
- Handle class imbalance if predicting price direction
- Consider multi-class classification (up/down/neutral)
- Feature selection based on correlation analysis

**5. Sentiment Scoring Approaches:**
- Use VADER (financial lexicon-aware)
- Fine-tune BERT/RoBERTa on financial news
- Create custom financial sentiment dictionary
- Combine multiple sentiment scores

**6. Validation Strategy:**
- Time-series cross-validation
- Walk-forward analysis
- Out-of-sample testing on recent data
