# QQQ GDELT News Sentiment Scraper

This notebook collects news articles related to QQQ ETF from GDELT (Global Database of Events, Language, and Tone).
GDELT provides access to news from around the world and goes back much further than Reddit data.

## 1. Install and Import Required Libraries

In [1]:
# Install required packages
%pip install gdeltdoc pandas vaderSentiment requests beautifulsoup4

# Import required libraries
from gdeltdoc import GdeltDoc, Filters
import pandas as pd
from datetime import datetime, timedelta
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import time

Collecting gdeltdoc
  Downloading gdeltdoc-1.12.0-py3-none-any.whl.metadata (7.0 kB)
Collecting typing-extensions>=4.13.0 (from gdeltdoc)
  Using cached typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Downloading gdeltdoc-1.12.0-py3-none-any.whl (17 kB)
Using cached typing_extensions-4.15.0-py3-none-any.whl (44 kB)
Installing collected packages: typing-extensions, gdeltdoc

  Attempting uninstall: typing-extensions

    Found existing installation: typing_extensions 4.12.2

    Uninstalling typing_extensions-4.12.2:

   ---------------------------------------- 0/2 [typing-extensions]
   ---------------------------------------- 0/2 [typing-extensions]
   ---------------------------------------- 0/2 [typing-extensions]
   ---------------------------------------- 0/2 [typing-extensions]
   ---------------------------------------- 0/2 [typing-extensions]
   ---------------------------------------- 0/2 [typing-extensions]
   ---------------------------------------- 0/2 [typing-e


[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: C:\Users\aarit\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


## 2. Define QQQ News Collection Function

In [5]:
def scrape_qqq_news(start_date, end_date, max_records=1000):
    """
    Scrape QQQ-related news from GDELT
    Args:
        start_date (str): Start date in format 'YYYY-MM-DD'
        end_date (str): End date in format 'YYYY-MM-DD'
        max_records (int): Maximum number of records to retrieve
    Returns:
        DataFrame: Processed news data with sentiment scores
    """
    # Initialize GDELT
    gd = GdeltDoc()
    
    # Define search terms for QQQ
    search_terms = ['QQQ ETF', 'Invesco QQQ', 'NASDAQ-100 ETF', 'QQQ Trust']
    
    all_articles = []
    
    print(f"Collecting QQQ news from {start_date} to {end_date}")
    
    for term in search_terms:
        try:
            print(f"\nSearching for: {term}...")
            
            # Create filters
            f = Filters(
                keyword=term,
                start_date=start_date,
                end_date=end_date
            )
            
            # Get articles (removed max_recursion_depth parameter)
            articles = gd.article_search(f)
            
            if articles is not None and len(articles) > 0:
                articles['search_term'] = term
                all_articles.append(articles)
                print(f"Found {len(articles)} articles for '{term}'")
            else:
                print(f"No articles found for '{term}'")
            
            # Rate limiting
            time.sleep(2)
            
        except Exception as e:
            print(f"Error searching for '{term}': {str(e)}")
            continue
    
    if not all_articles:
        print("No articles found")
        return pd.DataFrame()
    
    # Combine all articles
    news_df = pd.concat(all_articles, ignore_index=True)
    
    # Remove duplicates based on URL
    news_df = news_df.drop_duplicates(subset=['url'], keep='first')
    
    # Limit to max_records
    if len(news_df) > max_records:
        news_df = news_df.head(max_records)
    
    print(f"\nTotal unique articles: {len(news_df)}")
    
    return news_df

## 3. Add Sentiment Analysis

In [3]:
def add_sentiment_scores(df):
    """
    Add VADER sentiment scores to news articles
    """
    analyzer = SentimentIntensityAnalyzer()
    
    def get_sentiment(text):
        if pd.isna(text) or text == '':
            return pd.Series({'compound': 0, 'pos': 0, 'neu': 0, 'neg': 0})
        scores = analyzer.polarity_scores(text)
        return pd.Series(scores)
    
    print("Calculating sentiment scores...")
    
    # Calculate sentiment on title
    sentiment = df['title'].apply(get_sentiment)
    df = pd.concat([df, sentiment], axis=1)
    
    return df

## 4. Execute Data Collection

Set your desired date range below. GDELT data goes back several years.

In [6]:
# Set date range (you can go back several years with GDELT)
# Example: Last 6 months
end_date = datetime.now()
start_date = end_date - timedelta(days=180)  # 6 months

# Format dates for GDELT
start_date_str = start_date.strftime('%Y-%m-%d')
end_date_str = end_date.strftime('%Y-%m-%d')

print(f"Fetching news from {start_date_str} to {end_date_str}")

# Collect news data
news_df = scrape_qqq_news(start_date_str, end_date_str, max_records=2000)

# Add sentiment scores
if len(news_df) > 0:
    news_df = add_sentiment_scores(news_df)
    
    # Convert seendate to datetime
    news_df['date'] = pd.to_datetime(news_df['seendate'])
    
    print(f"\nCollection complete! Found {len(news_df)} articles")
    print(f"\nSample of collected data:")
    print(news_df[['date', 'title', 'domain', 'compound']].head())
else:
    print("No data collected")

Fetching news from 2025-05-14 to 2025-11-10
Collecting QQQ news from 2025-05-14 to 2025-11-10

Searching for: QQQ ETF...
Found 128 articles for 'QQQ ETF'
Found 128 articles for 'QQQ ETF'

Searching for: Invesco QQQ...

Searching for: Invesco QQQ...
Found 250 articles for 'Invesco QQQ'
Found 250 articles for 'Invesco QQQ'

Searching for: NASDAQ-100 ETF...

Searching for: NASDAQ-100 ETF...
Found 76 articles for 'NASDAQ-100 ETF'
Found 76 articles for 'NASDAQ-100 ETF'

Searching for: QQQ Trust...

Searching for: QQQ Trust...
Found 250 articles for 'QQQ Trust'
Found 250 articles for 'QQQ Trust'

Total unique articles: 535
Calculating sentiment scores...

Collection complete! Found 535 articles

Sample of collected data:
                       date  \
0 2025-10-30 08:45:00+00:00   
1 2025-10-19 10:45:00+00:00   
2 2025-09-15 14:15:00+00:00   
3 2025-09-05 14:00:00+00:00   
4 2025-09-10 20:30:00+00:00   

                                               title             domain  \
0  The Stock 

## 5. Aggregate and Save Data

In [7]:
if len(news_df) > 0:
    # Save detailed data
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    detailed_filename = f"QQQ_gdelt_news_{timestamp}.csv"
    news_df.to_csv(detailed_filename, index=False)
    print(f"Detailed data saved to {detailed_filename}")
    
    # Aggregate by date
    daily_sentiment = news_df.copy()
    daily_sentiment['date'] = pd.to_datetime(daily_sentiment['date']).map(lambda x: x.date())
    
    daily_summary = daily_sentiment.groupby('date').agg({
        'compound': 'mean',
        'pos': 'mean',
        'neg': 'mean',
        'neu': 'mean',
        'title': 'count'  # Number of articles per day
    }).reset_index()
    
    daily_summary.columns = ['date', 'compound', 'pos', 'neg', 'neu', 'article_count']
    
    # Save aggregated data
    summary_filename = f"QQQ_gdelt_daily_sentiment_{timestamp}.csv"
    daily_summary.to_csv(summary_filename, index=False)
    print(f"Daily summary saved to {summary_filename}")
    
    # Display summary statistics
    print("\n=== Summary Statistics ===")
    print(f"Total articles: {len(news_df)}")
    print(f"Date range: {daily_summary['date'].min()} to {daily_summary['date'].max()}")
    print(f"Average daily sentiment: {daily_summary['compound'].mean():.4f}")
    print(f"Average articles per day: {daily_summary['article_count'].mean():.1f}")
    print("\nDaily sentiment summary:")
    print(daily_summary.describe())

Detailed data saved to QQQ_gdelt_news_20251110_194823.csv
Daily summary saved to QQQ_gdelt_daily_sentiment_20251110_194823.csv

=== Summary Statistics ===
Total articles: 535
Date range: 2025-08-12 to 2025-11-11
Average daily sentiment: 0.1889
Average articles per day: 5.9

Daily sentiment summary:
        compound        pos        neg        neu  article_count
count  91.000000  91.000000  91.000000  91.000000      91.000000
mean    0.188853   0.118765   0.033807   0.847418       5.879121
std     0.179530   0.071819   0.038412   0.074319       3.161698
min    -0.273200   0.000000   0.000000   0.665750       1.000000
25%     0.073233   0.075833   0.000000   0.805167       3.000000
50%     0.196492   0.121615   0.025167   0.847889       6.000000
75%     0.312396   0.152950   0.050556   0.886562       7.000000
max     0.571000   0.334250   0.231000   1.000000      14.000000


## 6. Merge with Technical Data

Combine GDELT sentiment data with QQQ technical data

In [8]:
if len(news_df) > 0:
    # Load technical data
    technical_data = pd.read_csv("QQQ_Historical_DayByDay.csv")
    technical_data['Date'] = pd.to_datetime(technical_data['Date']).map(lambda x: x.date())
    
    # Merge with sentiment data
    merged_data = technical_data.merge(
        daily_summary,
        left_on='Date',
        right_on='date',
        how='left'
    ).drop('date', axis=1)
    
    # Fill missing sentiment values with 0
    sentiment_columns = ['compound', 'pos', 'neg', 'neu', 'article_count']
    merged_data[sentiment_columns] = merged_data[sentiment_columns].fillna(0)
    
    # Save merged data
    merged_filename = f"QQQ_technical_gdelt_sentiment_{timestamp}.csv"
    merged_data.to_csv(merged_filename, index=False)
    print(f"\nMerged data saved to {merged_filename}")
    
    # Display sample
    print("\nSample of merged data:")
    print(merged_data.head())


Merged data saved to QQQ_technical_gdelt_sentiment_20251110_194823.csv

Sample of merged data:
         Date        Open        High         Low       Close    Volume  \
0  2023-11-03  360.320764  364.538902  360.093546  363.244812  53280500   
1  2023-11-06  364.015365  365.289709  362.454565  364.726624  38848700   
2  2023-11-07  365.773740  369.043544  364.568554  368.174255  50777400   
3  2023-11-08  368.549600  369.251000  366.119481  368.411316  35663400   
4  2023-11-09  369.102828  370.248715  365.082243  365.576172  53859400   

   Dividends  Stock Splits  Capital Gains  Daily_Return  ...  Volume_MA_10  \
0        0.0           0.0            0.0           NaN  ...           NaN   
1        0.0           0.0            0.0      0.004079  ...           NaN   
2        0.0           0.0            0.0      0.009453  ...           NaN   
3        0.0           0.0            0.0      0.000644  ...           NaN   
4        0.0           0.0            0.0     -0.007696  ...   

  technical_data['Date'] = pd.to_datetime(technical_data['Date']).map(lambda x: x.date())


## Notes

### GDELT Advantages:
- Historical data going back several years
- Global news coverage
- Multiple languages
- High-quality financial news sources

### Tips:
- Start with smaller date ranges to test
- GDELT has rate limits, so add delays between requests
- You can adjust the date range to get more historical data
- Consider running multiple date ranges if you need extensive historical data