# Setting up the analysis

## Getting the data on the sample firms

In [1]:
import glob
import os
from time import sleep

import pandas as pd
from sec_edgar_downloader import Downloader
from tqdm import tqdm

import zipfile
import fnmatch

import warnings
warnings.filterwarnings("ignore", message="It looks like you're parsing an XML document using an HTML parser")

In [2]:
# places to put files - best practice chapter 2!

os.makedirs("inputs", exist_ok=True)
os.makedirs("10k_files", exist_ok=True)

## Step 1: Get the URL to the S&P 500 firms

Using a sample of S&P500 firms is sensible. Two major points come up, the first of which we discussed in class a lot.

The obvious limitation is "what if the relationships between our risk measurements and returns during a pandemic are different for smaller firms outside the S&P500"? This is a good concern, and worthy of discussion in your results. Do you have an **economic argument** for why your particular risks would be more or less relevant in a pandemic for small firms (than for the larger firms in the S&P500)? Depending on your answer, that means a relationship you find might be too high or too low. Maybe the sign of the relationship flips. 

The second major issue how we get the list of S&P 500 firms below. This code gets the list of S&P 500 firms **as of today.** So our sample (A) excludes firms that were S&P in Mar 2020 but no longer are and (B) includes firms that weren't before but are now.  This could bias our results. 
- Perhaps the firms we are erroneously missing (which we know had poor returns) had HIGH risk factors (which is why they did poorly). So excluding them makes it harder to find a risk-return relationship. 
- Perhaps the firms that we are erroneously including (with high returns) had low risk factors (which is why they fared better in the pandemic). So including them makes it easier to find a risk-return relationship. 

Putting those arguments together, the very way I constructed this sample might bias the results, but it depends on the specifics of the "leavers" and "joiners". 


In [3]:
# get a file with sample firms and info on them
# (somewhat simplistic option!)

sp500_file = 'inputs/sp500_2022.csv'

if not os.path.exists(sp500_file):
    url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
    pd.read_html(url)[0].to_csv(sp500_file,index=False)

sp500 = pd.read_csv('inputs/sp500_2022.csv')    

## Step 2: Download their last 10-K before the pandemic started

This took 4 seconds per download. 

In total: ~42 minutes, and downloaded a 10-K for X of the 503 firms.

This code here does not attempt to fix or explore why X 10-Ks are missing. Do you know why?

In [4]:
dl = Downloader("10k_files") # all files will go within this folder

In [5]:
# First, check what files we already have in the ZIP folder 
# We will use this in the next block to avoid downloading duplicates
zip_folder_name = '10k_files/10k_files.zip'
if os.path.exists(zip_folder_name):
    with zipfile.ZipFile(zip_folder_name, 'r') as zip:
        # Get a list of all the files in the ZIP folder using the namelist() method
        file_list = zip.namelist()
else:
    file_list = []
    
# Append new files to the existing ZIP folder
with zipfile.ZipFile(zip_folder_name, 'a',zipfile.ZIP_DEFLATED) as zip:

    # Loop over a subset of firms
    for firm in tqdm(sp500['Symbol'][:20]):

        # look in the file_list (from the existing list) for files from this firm
        # note: first folder level inside the zip is sec-edgar-filings
        pattern = 'sec-edgar-filings/'+firm+'/10-K/*/*.html'
        firm_files = fnmatch.filter(file_list, pattern)  # Check if any matching file already exists
        
        # If we haven't downloaded any HTML files for this firm, do so        
        if len(firm_files) == 0:  
            dl.get("10-K", firm, amount=1, after="2022-01-01", before="2022-12-31")

        # put a pause here to be nice to server 
        # NVM: not needed! sec_edgar_downloader automatically limits speed 

        # Add any new HTML files to the ZIP folder and delete them from the local folder
        for f in glob.glob('10K_files/sec-edgar-filings/'+firm+'/10-K/*/*'):
            
            # note: to match the output of ZIPAFTER, save this to the zip folder
            # with a filepath that starts at sec-edgar-filings
            # this is hacky and I wouldnt do this EXCEPT I want both routes to have same 
            # intermediate zip structures so future lessons work no matter which route you 
            # choose
            zip_f = f[10:]
            
            if zip_f.endswith('.html') and not zip_f in file_list:  
                zip.write(f,zip_f)  
            os.remove(f)  # Delete the file from the local folder


100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [01:15<00:00,  3.78s/it]


In [6]:
glob.glob('10K_files/sec-edgar-filings/'+firm+'/10-K/*/*')

[]

## Unnecessary cleaning up step

The `10k_files` folder _should_ be a tree of folders with no files in it. You can check that in your OS easily and manually delete if so. But this will do it automatically for you. After running this, if there are any folders left, that means you have files inside it somewhere. 

In [7]:
def delete_empty_directories(path):
    # Loop over all files and directories in the current directory
    for entry in os.listdir(path):
        full_path = os.path.join(path, entry)

        # If the entry is a directory, recurse into it
        if os.path.isdir(full_path):
            delete_empty_directories(full_path)

            # If the directory is empty after deleting subdirectories, delete it
            if not os.listdir(full_path):
                os.rmdir(full_path)

# call the recursive function to delete empty directories inside the directory
delete_empty_directories('10k_files')