# Setting up the analysis

## Getting the data on the sample firms

In [1]:
import glob
import os
from time import sleep

import pandas as pd
from sec_edgar_downloader import Downloader
from tqdm import tqdm
import shutil

import warnings
warnings.filterwarnings("ignore", message="It looks like you're parsing an XML document using an HTML parser")

In [2]:
# places to put files - best practice chapter 2!

os.makedirs("inputs", exist_ok=True)
os.makedirs("10k_files", exist_ok=True)

## Step 1: Get the URL to the S&P 500 firms

Using a sample of S&P500 firms is sensible. Two major points come up, the first of which we discussed in class a lot.

The obvious limitation is "what if the relationships between our risk measurements and returns during a pandemic are different for smaller firms outside the S&P500"? This is a good concern, and worthy of discussion in your results. Do you have an **economic argument** for why your particular risks would be more or less relevant in a pandemic for small firms (than for the larger firms in the S&P500)? Depending on your answer, that means a relationship you find might be too high or too low. Maybe the sign of the relationship flips. 

The second major issue how we get the list of S&P 500 firms below. This code gets the list of S&P 500 firms **as of today.** So our sample (A) excludes firms that were S&P in Mar 2020 but no longer are and (B) includes firms that weren't before but are now.  This could bias our results. 
- Perhaps the firms we are erroneously missing (which we know had poor returns) had HIGH risk factors (which is why they did poorly). So excluding them makes it harder to find a risk-return relationship. 
- Perhaps the firms that we are erroneously including (with high returns) had low risk factors (which is why they fared better in the pandemic). So including them makes it easier to find a risk-return relationship. 

Putting those arguments together, the very way I constructed this sample might bias the results, but it depends on the specifics of the "leavers" and "joiners". 


In [3]:
# get a file with sample firms and info on them
# (somewhat simplistic option!)

sp500_file = 'inputs/sp500_2022.csv'

if not os.path.exists(sp500_file):
    url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
    pd.read_html(url)[0].to_csv(sp500_file,index=False)

sp500 = pd.read_csv('inputs/sp500_2022.csv')    

## Step 2: Download their last 10-K before the pandemic started

This took 4 seconds per download. 

In total: ~42 minutes, and downloaded a 10-K for X of the 503 firms.

This code here does not attempt to fix or explore why X 10-Ks are missing. Do you know why?

In [4]:
dl = Downloader("10k_files") # all files will go within this folder

In [5]:
# assumption: if we have a zip file, it means we are done with downloads
# so don't download anything

if not os.path.exists('10k_files/10k_files.zip'):

    for firm in tqdm(sp500['Symbol'][:20]):

        firm_folder = "10k_files/sec-edgar-filings/" + firm

        # if I haven't downloaded an HTML for this firm, do so
        if len(glob.glob(firm_folder + '/10-K/*/*.html')) == 0:
            dl.get("10-K", firm, amount=1, after="2022-01-01", before="2022-12-31")

        # pause - be nice to server 
        # NVM: not needed! sec_edgar_downloader automatically limits speed 

        # we don't need the .txt files. If there is one for this firm, delete it
        for txt_f in glob.glob(firm_folder + '/10-K/*/*.txt'):
            os.remove(txt_f)    

100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [01:12<00:00,  3.63s/it]


## Step 3: Reduce the hard drive space 

_Note: I've made some choices below to ensure the resulting structure is the same as we get from our "ZIPDURING" route. As always, there are multiple ways to achieve the same thing. This is one way._

Don't run this until you are done with downloads. What is below is a "one shot" code. Use it once only. I made this choice explicit.

In [7]:
# set to True to run the code below. make sure you are done with downloads first!
# see if your folder has ~500ish html files, and take the screenshot from instructions
done_with_downloads = False 

if os.path.exists('10k_files/sec-edgar-filings') and \
    not os.path.exists('10k_files/10k_files.zip') and \
    done_with_downloads:
    
    # zip the folder (15GB --> 3GB)
    shutil.make_archive('10k_files', 'zip', '10k_files')
    
    # delete the folder 
    shutil.rmtree('10k_files/sec-edgar-filings')
    
    # put the zip file in the `10k_files` folder
    shutil.move('10k_files.zip', '10k_files/')