# Capstone Data Collection Notebook

This notebook collects and prepares data for a capstone project focused on forecasting short- to medium-term price movements of the 'Magnificent 7' stocks using LSTM and XGBoost models.

## Step 2: Define Tickers
Define the list of 'Magnificent 7' tickers.

In [20]:
# Define the Magnificent 7 tickers (including MELI as per project scope)
tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'META', 'TSLA', 'NVDA']
print('Tickers:', tickers)

Tickers: ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'META', 'TSLA', 'NVDA']


## Step 3: Download Historical Stock Price and Volume Data
Download daily OHLCV data for all tickers from 2020-01-01 to today using yfinance.

In [39]:
import yfinance as yf
import pandas as pd
from datetime import datetime

# Use the tickers defined earlier for consistency
tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'META', 'TSLA', 'NVDA']

# Set date range
start_date = '2020-01-01'
end_date = datetime.today().strftime('%Y-%m-%d')

# Initialize empty DataFrame
combined_df = pd.DataFrame()

for ticker in tickers:
    df = yf.download(ticker, start=start_date, end=end_date, progress=False)
    # If columns are MultiIndex, flatten to just the first level (e.g., 'Close', not ('Close', 'AAPL'))
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = df.columns.get_level_values(0)
    df = df.reset_index()
    df['ticker'] = ticker
    if 'Date' in df.columns:
        df.rename(columns={'Date': 'date'}, inplace=True)
    # Append to combined_df
    combined_df = pd.concat([combined_df, df], ignore_index=True)

# Reorder columns: date, ticker, then the rest
remaining_cols = [col for col in combined_df.columns if col not in ['date', 'ticker']]
combined_df = combined_df[['date', 'ticker'] + remaining_cols]

# Sort and save
combined_df = combined_df.sort_values(['date', 'ticker']).reset_index(drop=True)
combined_df.to_csv('magnificent7_price_data.csv', index=False)

# Show sample
print(combined_df.head())

  df = yf.download(ticker, start=start_date, end=end_date, progress=False)
  df = yf.download(ticker, start=start_date, end=end_date, progress=False)
  df = yf.download(ticker, start=start_date, end=end_date, progress=False)
  df = yf.download(ticker, start=start_date, end=end_date, progress=False)
  df = yf.download(ticker, start=start_date, end=end_date, progress=False)
  df = yf.download(ticker, start=start_date, end=end_date, progress=False)
  df = yf.download(ticker, start=start_date, end=end_date, progress=False)
  df = yf.download(ticker, start=start_date, end=end_date, progress=False)
  df = yf.download(ticker, start=start_date, end=end_date, progress=False)
  df = yf.download(ticker, start=start_date, end=end_date, progress=False)
  df = yf.download(ticker, start=start_date, end=end_date, progress=False)
  df = yf.download(ticker, start=start_date, end=end_date, progress=False)
  df = yf.download(ticker, start=start_date, end=end_date, progress=False)


Price       date ticker       Close        High         Low        Open  \
0     2020-01-02   AAPL   72.620857   72.681304   71.373233   71.627107   
1     2020-01-02   AMZN   94.900497   94.900497   93.207497   93.750000   
2     2020-01-02  GOOGL   68.026016   68.026016   66.923134   67.018562   
3     2020-01-02   META  208.635422  208.645362  205.144578  205.621955   
4     2020-01-02   MSFT  153.042358  153.147169  150.860401  151.289168   

Price     Volume  
0      135480400  
1       80580000  
2       27278000  
3       12077100  
4       22622100  


## Step 4: Collect Reddit Sentiment Posts
Use the PRAW API to collect Reddit posts mentioning any of the tickers in r/stocks and r/wallstreetbets from the last 7 days.

In [22]:
import praw
import pandas as pd
from datetime import datetime, timedelta, timezone
import requests

# PRAW credentials
client_id = 'V3W03cHGK3tUYlNXUutcpQ'
client_secret = 'yCswNzYXDUP7jJpikuxeUGRfh9bAdQ'
user_agent = 'CapstoneSentiment by lezama73'

reddit = praw.Reddit(client_id=client_id, client_secret=client_secret, user_agent=user_agent)

# Expanded list of subreddits for more coverage
subreddits = [
    'stocks',
    'wallstreetbets',
    'investing',
    'StockMarket',
    'options',
    'RobinHood',
    'securityanalysis',
    'algotrading',
    'technology',
    'TSLA',
    'NVDA_Stock',
    'NvidiaStock',
    'AAPL'
]
tickers_upper = [t.upper() for t in tickers]

# Set time window for Pushshift (3 years)
days_back = 3 * 365
max_results = 100000  # Max results to fetch from Pushshift
since = datetime.now(timezone.utc) - timedelta(days=days_back)
after = int(since.timestamp())

def fetch_pushshift(subreddit, tickers, after, before=None, max_results=max_results):
    url = 'https://api.pushshift.io/reddit/search/submission/'
    all_posts = []
    for ticker in tickers:
        params = {
            'subreddit': subreddit,
            'q': ticker,
            'after': after,
            'size': 500,
            'sort': 'desc',
            'sort_type': 'created_utc'
        }
        if before:
            params['before'] = before
        count = 0
        while count < max_results:
            resp = requests.get(url, params=params)
            if resp.status_code != 200:
                break
            data = resp.json().get('data', [])
            if not data:
                break
            for post in data:
                all_posts.append({
                    'date': datetime.fromtimestamp(post['created_utc'], tz=timezone.utc).strftime('%Y-%m-%d'),
                    'ticker': ticker,
                    'title': post.get('title', ''),
                    'score': post.get('score', 0),
                    'num_comments': post.get('num_comments', 0)
                })
            count += len(data)
            if len(data) < 500:
                break
            params['before'] = data[-1]['created_utc']
        print(f"Pushshift: {subreddit} {ticker} posts collected: {count}")
    return all_posts

# Collect posts from Pushshift
pushshift_posts = []
for subreddit in subreddits:
    pushshift_posts.extend(fetch_pushshift(subreddit, tickers_upper, after))

# Collect recent posts from PRAW
recent_since = datetime.now(timezone.utc) - timedelta(days=days_back)
praw_posts = []
for subreddit in subreddits:
    for submission in reddit.subreddit(subreddit).new(limit=max_results):
        post_date = datetime.fromtimestamp(submission.created_utc, tz=timezone.utc)
        if post_date < recent_since:
            continue
        title = submission.title.upper()
        for ticker in tickers_upper:
            if ticker in title:
                praw_posts.append({
                    'date': post_date.strftime('%Y-%m-%d'),
                    'ticker': ticker,
                    'title': submission.title,
                    'score': submission.score,
                    'num_comments': submission.num_comments
                })
                break

# Combine and deduplicate posts
all_posts = pushshift_posts + praw_posts
reddit_df = pd.DataFrame(all_posts).drop_duplicates(['date','ticker','title'])
print(f"Total Reddit posts collected: {len(reddit_df)}")
reddit_df.to_csv('reddit_sentiment_posts.csv', index=False)

KeyboardInterrupt: 

## Step 5: Download and Save Macroeconomic Data from FRED
Download key U.S. macroeconomic indicators for the same date range as the stock data (2020-01-01 to today) using the FRED API.

In [40]:
# Download and save macroeconomic data from FRED
import pandas_datareader.data as web
from datetime import datetime

# Define FRED indicators and their codes
fred_series = {
    'GDP': 'GDP',  # US Gross Domestic Product (quarterly)
    'UNRATE': 'UNRATE',  # Unemployment Rate
    'CPI': 'CPIAUCSL',  # Consumer Price Index
    'FEDFUNDS': 'FEDFUNDS',  # Effective Federal Funds Rate
    'DGS10': 'DGS10',  # 10-Year Treasury Constant Maturity Rate (market yield)
    'GS10': 'GS10',    # 10-Year Treasury Constant Maturity Rate (secondary market)
    'M2': 'M2SL',  # M2 Money Stock
    'SP500': 'SP500',  # S&P 500 Index (for reference)
    'VIXCLS': 'VIXCLS'  # CBOE Volatility Index
}

start_date = '2020-01-01'
end_date = datetime.today().strftime('%Y-%m-%d')

macro_df = pd.DataFrame()
for name, code in fred_series.items():
    try:
        data = web.DataReader(code, 'fred', start_date, end_date)
        data = data.rename(columns={code: name})
        macro_df = pd.concat([macro_df, data], axis=1)
    except Exception as e:
        print(f"Could not download {name} ({code}):", e)

macro_df.index.name = 'Date'
macro_df.to_csv('us_macro_data_fred.csv')
print('Macroeconomic data saved to us_macro_data_fred.csv')
macro_df.tail()

Macroeconomic data saved to us_macro_data_fred.csv


Unnamed: 0_level_0,GDP,UNRATE,CPI,FEDFUNDS,DGS10,GS10,M2,SP500,VIXCLS
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2025-07-28,,,,,4.42,,,6389.77,15.03
2025-07-29,,,,,4.34,,,6370.86,15.98
2025-07-30,,,,,4.38,,,6362.9,15.48
2025-07-31,,,,,4.37,,,6339.39,16.72
2025-08-01,,,,,,,,6238.01,


In [41]:
# Forward-fill missing macroeconomic data with the most recent previous value
macro_df_ffill = macro_df.ffill()
macro_df_ffill.index.name = 'Date'
macro_df_ffill.to_csv('us_macro_data_fred.csv')
print('Macroeconomic data (forward-filled) saved to us_macro_data_fred.csv')
macro_df_ffill.tail()

Macroeconomic data (forward-filled) saved to us_macro_data_fred.csv


Unnamed: 0_level_0,GDP,UNRATE,CPI,FEDFUNDS,DGS10,GS10,M2,SP500,VIXCLS
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2025-07-28,30331.117,4.2,321.5,4.33,4.42,4.39,22020.8,6389.77,15.03
2025-07-29,30331.117,4.2,321.5,4.33,4.34,4.39,22020.8,6370.86,15.98
2025-07-30,30331.117,4.2,321.5,4.33,4.38,4.39,22020.8,6362.9,15.48
2025-07-31,30331.117,4.2,321.5,4.33,4.37,4.39,22020.8,6339.39,16.72
2025-08-01,30331.117,4.2,321.5,4.33,4.37,4.39,22020.8,6238.01,16.72


In [42]:
# Preview price data
print('Price Data Preview:')
display(combined_df.tail(100))

# Preview Reddit sentiment data
print('Reddit Sentiment Data Preview:')
display(reddit_df.tail(100))

# Preview macroeconomic data
print('Macroeconomic Data Preview:')
display(macro_df_ffill.tail(100))

import os
print('magnificent7_price_data.csv exists:', os.path.exists('magnificent7_price_data.csv'))
print('reddit_sentiment_posts.csv exists:', os.path.exists('reddit_sentiment_posts.csv'))
print('us_macro_data_fred.csv exists:', os.path.exists('us_macro_data_fred.csv'))

Price Data Preview:


Price,date,ticker,Close,High,Low,Open,Volume
9721,2025-07-14,NVDA,164.070007,165.490005,162.020004,165.369995,136975800
9722,2025-07-14,TSLA,316.899994,322.600006,312.670013,317.730011,78043400
9723,2025-07-15,AAPL,209.110001,211.889999,208.919998,209.220001,42296300
9724,2025-07-15,AMZN,226.350006,227.270004,225.460007,226.199997,34907300
9725,2025-07-15,GOOGL,182.000000,184.220001,181.600006,182.809998,33448300
...,...,...,...,...,...,...,...
9816,2025-08-01,GOOGL,189.130005,190.830002,187.820007,189.029999,34797400
9817,2025-08-01,META,750.010010,766.000000,745.309998,760.729980,18991000
9818,2025-08-01,MSFT,524.109985,535.799988,520.859985,535.000000,28955600
9819,2025-08-01,NVDA,173.720001,176.539993,170.889999,174.089996,203851100


Reddit Sentiment Data Preview:


Unnamed: 0,date,ticker,title,score,num_comments
964,2024-07-03,AAPL,Final trade update: AAPL hits new ATH! Have a ...,16,9
965,2024-07-02,AAPL,AAPL - testing resistance again. Can it break ...,0,4
966,2024-06-29,AAPL,$AAPL stock price at close Friday 06/28/24,4,6
967,2024-06-28,AAPL,Closing chart AAPL June28 212 1/2 calls,1,1
968,2024-06-25,AAPL,Trade recap: long AAPL call options,1,0
...,...,...,...,...,...
1059,2022-10-17,META,Tim on to something? I think the meta verse as...,11,0
1060,2022-10-13,AAPL,A graph showing how much $AAPL you need to buy...,14,13
1061,2022-09-23,AAPL,"SPRO, BBBY, TSLA, LITM, AMD, MSFT, AAPL, SAVA,...",1,1
1062,2022-09-18,AAPL,I’m new to the market. I feel I should invest ...,5,11


Macroeconomic Data Preview:


Unnamed: 0_level_0,GDP,UNRATE,CPI,FEDFUNDS,DGS10,GS10,M2,SP500,VIXCLS
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2025-03-18,29962.047,4.2,319.615,4.33,4.29,4.28,21656.9,5614.66,21.70
2025-03-19,29962.047,4.2,319.615,4.33,4.25,4.28,21656.9,5675.29,19.90
2025-03-20,29962.047,4.2,319.615,4.33,4.24,4.28,21656.9,5662.89,19.80
2025-03-21,29962.047,4.2,319.615,4.33,4.25,4.28,21656.9,5667.56,19.28
2025-03-24,29962.047,4.2,319.615,4.33,4.34,4.28,21656.9,5767.57,17.48
...,...,...,...,...,...,...,...,...,...
2025-07-28,30331.117,4.2,321.500,4.33,4.42,4.39,22020.8,6389.77,15.03
2025-07-29,30331.117,4.2,321.500,4.33,4.34,4.39,22020.8,6370.86,15.98
2025-07-30,30331.117,4.2,321.500,4.33,4.38,4.39,22020.8,6362.90,15.48
2025-07-31,30331.117,4.2,321.500,4.33,4.37,4.39,22020.8,6339.39,16.72


magnificent7_price_data.csv exists: True
reddit_sentiment_posts.csv exists: True
us_macro_data_fred.csv exists: True
