# Setting up the analysis

## Getting the data on the sample firms

In [1]:
import pandas as pd
import os
from sec_edgar_downloader import Downloader

import warnings
warnings.filterwarnings("ignore", message="It looks like you're parsing an XML document using an HTML parser")

os.makedirs('inputs',exist_ok=True)
os.makedirs('10k_files',exist_ok=True)

In [None]:
# places to put files - best practice chapter 2!

os.makedirs("input", exist_ok=True)
os.makedirs("10k_files", exist_ok=True)

## Step 1: Get a list of S&P 500 firms

Using a sample of S&P500 firms is sensible. Two major points come up, the first of which we discussed in class a lot.

The obvious limitation is "what if the relationships between our  measurements and returns are different for smaller firms outside the S&P500"? This is a good concern, and worthy of discussion in your results. Do you have an **economic argument** for why your particular risks would be more, less, or oppositely relevant for small firms? Depending on your answer, that means a relationship you find might be too high or too low. Maybe the sign of the relationship flips. 

The second major issue how we get the list of S&P 500 firms below. This code gets the list of S&P 500 firms **as of today.** So our sample (A) excludes firms that were S&P in Mar 2022 but no longer are and (B) includes firms that weren't before but are now.  This could bias our results. 
- Counterargument: The firms that joined after Jan 1 2023 were likely pretty close to the inclusion threshold during 2022
- Perhaps the firms we are erroneously missing (which we know had poor returns) had HIGH risk factors (which is why they did poorly). So excluding them makes it harder to find a risk-return relationship. 
- Perhaps the firms that we are erroneously including (with high returns) had low risk factors (which is why they fared better in the pandemic). So including them makes it easier to find a risk-return relationship. 

Putting those arguments together, the very way I constructed this sample might bias the results, but it depends on the specifics of the "leavers" and "joiners". 


In [2]:
# get a file with sample firms and info on them
# (somewhat simplistic option!)

sp500_file = 'inputs/sp500_2022.csv'

if not os.path.exists(sp500_file):
    url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
    pd.read_html(url)[0].to_csv(sp500_file,index=False)

sp500 = pd.read_csv('inputs/sp500_2022.csv')    

## Step 2: Download their last 10-K during 2022


In [3]:
dl = Downloader("10k_files")

In [24]:
# need to change (EVENTUALLY) `firms` to the full list of tickers 
firms = ['MSFT', 'AMZN', 'AAPL']
before = '2020-03-01'
res_path = '10K_files'
download_type = '10-K'
amount = 1

for firm in firms:
    
    # plan: check if I've already downloaded this firms filing
    # if yes, skip
    
    # where is this firms filings...
    
    tic_res_path = fr'{res_path}\sec-edgar-filings\{firm}\{download_type}'

    we_have_files_for_this_firm = os.path.exists(tic_res_path) and \
            len(os.listdir(tic_res_path)) >= amount
    
    # approach 1: if we haven't downloaded it
    if not we_have_files_for_this_firm:
        dl.get('10-K', firm, before='2020-03-01', amount=1) # download_details=False?
              
    # approach 2: if we have downloaded it, skip to next thing in for-loop
    # if <condition>:
    #     continue
    # dl.get()
    
    #todo delete the txt file!
    
    #todo don't upload all these files!
   
    

Questions today

1. how to download lots of these 10-Ks
1. don't redownload existing filings 
1. boy, these files are big... which file to keep (txt, html, or both?)
    - folder structure ok?
        - sure, some redundancies, but it works out of the box, easy to change
        - GOOD HABIT: keep master doc list...(in this proj: sp500_2022.csv is the master doc list, only one filing per firm!)
    - tic or CIK?
        - tick easier to use, but unstable... tickers change!
1. spider isssues
    - progress?
    - speed?
    