# Pre-process Market Data

Market data (price and volume time series for each stock) will be used to benchmark the NLP models. This notebook retrieves the market data during the time period over which popularity data are available.

In [1]:
import sys
import os
import re
import time

import pandas as pd
import numpy as np 
import yfinance as yf
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

  import pandas.util.testing as tm


### Extract List of Stock Tickers

The tickers available in the Robinhood popularity data provide a natural limit to the analysis. Each ticker available has an associated .csv file. Extract those tickers and store them in a flat .csv file.

In [2]:
ticker_filenames = [file for file in os.walk("../../data/tabular/popularity_export")][0][2]

prog = re.compile(r'.csv')
suffix_starts = {filename: prog.search(filename).start() for filename in ticker_filenames}
ticker_names  = sorted([filename[:suffix] for filename, suffix in suffix_starts.items()
                        if '_' not in filename])
ticker_df = pd.DataFrame({'Ticker': ticker_names})
ticker_df.to_csv('../../data/tabular/tickers.csv', index=False, header=True)

### Download Close Prices

I'll use the yfinance library to call daily close prices from Yahoo Finance for each ticker. I don't want to get rate limited or blacklisted due to aggressive scraping (executed by yfinance under the hood) so to avoid this I'll set a 1 second delay after each ticker call. This will take about two hours to call every series. To make this process robust to errors I'll split it up by the first letter of each ticker.

In [3]:
# Define a column corresponding to the first letter of each ticker.
ticker_df['LeadingLetter'] = ticker_df['Ticker'].str.slice(0, 1)

def get_letter_history(df, 
                       letter, 
                       directory='../../data/tabular/daily_price',
                       start_date=pd.to_datetime('2017-01-01'), 
                       end_date=pd.to_datetime('2020-08-14'),
                       verbose=True):
    
    tickers = ticker_df.loc[ticker_df.LeadingLetter == letter, 'Ticker']
    
    if verbose:
        print('Downloading tickers starting with {}...'.format(letter))
    
    for i, ticker in tickers.iteritems():
        ticker_yf = yf.Ticker(ticker)
        ticker_yf.history(start=start_date, end=end_date).to_csv('{}/{}.csv'.format(directory, ticker))
        if verbose:
            print('\t{}'.format(ticker))
        
    if verbose:
        print('Daily price files for letter {} saved to {}'.format(letter, directory))
    
    time.sleep(1.5)
        

In [5]:
original_stdout = sys.stdout

for letter in ticker_df.LeadingLetter.unique()[1:]:
    print('Downloading {}'.format(letter))
    
    # save download log to text file
    logfile = '../../log/download log {} - 2021-3-14.txt'.format(letter)
    with open(logfile, 'w') as f:
        sys.stdout = f
        get_letter_history(ticker_df, letter)
    sys.stdout = original_stdout



Downloading B
Downloading C
Downloading D
Downloading E
Downloading F
Downloading G
Downloading H
Downloading I
Downloading J
Downloading K
Downloading L
Downloading M
Downloading N
Downloading O
Downloading P
Downloading Q
Downloading R
Downloading S
Downloading T
Downloading U
Downloading V
Downloading W
Downloading X
Downloading Y
Downloading Z


In [9]:
posts_2018 = pd.read_csv('../../data/tabular/wsb_posts/WSB Posts 2018.csv')