### sec-utils getting started notebook

sec-utils makes the actual process of downloading SEC files easier with an API that handles building master.idx files to actually iterating and downloading the files in parallel. The Edgar database hosts TB's of data and these jobs can take a considerable amount of time.  sec-utils helps ensure smooth operations including logging, resuming downloading, restarting downloads, etc.

There are two main ways to interface with the package. You can interface with the package via the command line to kick off long running jobs, or if you want to get a bit more hands on with the work you can interface with it directly. This notebook will cover the latter. 

##### Notebook Overview:
- [Installing secutils](#install)
- [Load sec utils and build FormIDX](#load)
- [Download SEC Files](#download)

##### Installing secutils <a id='install' />
There are two primary methods of installing sec-utils. The first is via the python packaging index. The second is straight from source. 

To install from pypi:
```bash
pip install secutils
```

And to install from source:
```bash
git clone https://github.com/datawrestler/sec-utils && cd sec-utils
conda create --name sec_env python=3.7 pip
conda activate sec_env
pip install -r requirements.txt
pip install -e .
```

##### Loading secutils <a id='load' />

In [1]:
from secutils.edgar import FormIDX

In [2]:
# the main interface to master index's are through FormIDX where you can specify the year and quarter of filings that you want,
# the specific form types, etc.

form = FormIDX(year=2017, quarter=4, form_types=['10-K'])

specified form type not found in master.idx (2017) - (4) - form not found: []


In [3]:
# we can view the index file with:
form.master_index.head()

Unnamed: 0,CIK,Company Name,Form Type,Date Filed,Filename,fname
61,1000230,OPTICAL CABLE CORP,10-K,2017-12-20,edgar/data/1000230/0001437749-17-020936.txt,0001437749-17-020936.txt
532,1001039,WALT DISNEY CO/,10-K,2017-11-22,edgar/data/1001039/0001001039-17-000198.txt,0001001039-17-000198.txt
618,1001115,GEOSPACE TECHNOLOGIES CORP,10-K,2017-12-01,edgar/data/1001115/0001564590-17-024421.txt,0001564590-17-024421.txt
805,1002037,"LEARNING TREE INTERNATIONAL, INC.",10-K,2017-12-15,edgar/data/1002037/0001437749-17-020789.txt,0001437749-17-020789.txt
854,1002517,"Nuance Communications, Inc.",10-K,2017-11-29,edgar/data/1002517/0001002517-17-000040.txt,0001002517-17-000040.txt


In [6]:
print(f"Year: {form.year} -- Qtr: {form.quarter} -- Total companies: {form.master_index.CIK.nunique()} -- Total files: {form.master_index.shape[0]}")

Year: 2017 -- Qtr: 4 -- Total companies: 504 -- Total files: 515


##### Download SEC files <a id='download' />

In [8]:
# downloading sec files is simple - using a prebuilt form, we can convert the master.idx into files
files = form.index_to_files()

In [21]:
# each file has basic attributes like form_type, company_name, download_url, etc.
ex = files[0]
msg = f"""
      Company Name: {ex.company_name}
      CIK Number: {ex.cik_number}
      Date Filed: {ex.date_filed}
      Form Type: {ex.form_type}
      File Name: {ex.file_name}
      Download URL: {ex.file_download_url}
      """

print(msg)


      Company Name: OPTICAL CABLE CORP
      CIK Number: 1000230
      Date Filed: 2017-12-20 00:00:00
      Form Type: 10-K
      File Name: 0001437749-17-020936.txt
      Download URL: https://www.sec.gov/Archives/edgar/data/1000230/0001437749-17-020936.txt
      


In [12]:
# to download our example file:
output_dir = '.'
ex.download_file(output_dir) # 200 is a successful download

'200'

In [16]:
import os
list(filter(lambda x: x.endswith('txt'), os.listdir(output_dir)))

['0001437749-17-020936.txt']

In [18]:
# to confirm this is indeed our file:
ex.file_name

'0001437749-17-020936.txt'

##### Closing remarks

Getting hands on is great, however using the CLI does provide several advantages:
- Automatic directory structure creation
- Built in logging and caching
- Ability to resume training via download scanning
- Multi-threaded file downloading