# 2. Feature Generation

This module has the following purpose:
1. For every company, and for the periods from 2009Q1 to 2018Q4, identify the CEOs, the CFOs, and the start of their tenure (quarter).
2. (See Capstone Project Proposal).

## 2.1 CEO/CFO identification
  
WikiData is not robust enough for this task. In fact, of the 5 first companies in the companies list, only 1 had info on their current CEO. None of the first 5 companies had any information on past CEOs, or any CFO info.
  
SEC's EDGAR system offers the quartely filings ("10-Q") in plain text format.
The great thing about the 10-Q forms is that both CEO and CFO must certify a legal form as part of the filing. This form always starts with:  
`I, Steven Roth, certify that:`  
Steven Roth is either CEO or CFO. Further down in this section the officer does sign with name and title, thus yielding CEO or CFO.  

Better yet - these 10-Q are addressable via a unique identifier, which is listed in the SUB dataframe. This unique identifier can actually directly be adressed via the URL, and yields a .txt file:  
https://www.sec.gov/Archives/edgar/data/3499/000000349918000023/0000003499-18-000023.txt

Summarized:
1. Get list of 10-Q filing codes for each company from the dataframe.
2. Access the URL that yields the 10-Q text file.
3. Use Regex and/or text matching to identify the officers.
4. Use Regex and/or text matching to identify their titles.

This yields:  

| CIK        | Quarter           | CEO  | CFO |
| :------------ |:-------------:| -----:|-----:|
| 00002354   | 2015Q1 | Jim Jones | Tim Bucks |
| 00002354   | 2015Q2      |   Jim Jones | Tim Bucks |
| 00002354 | 2015Q3      | Jane Jackson | Tim Bucks |


In [1]:
# to restore pickled files
import pickle

# to work with data
import pandas as pd

Restore the full dictionary with dataframes to have maximum data flexiblity.

In [2]:
# restore the pickled dictionary of dataframes
try: DataFrames
except NameError:
    with open('dict_of_dfs_num_pre_sub_tag.p', 'rb') as SECfile:
        DataFrames = pickle.load(SECfile)

In [3]:
DataFrames['SUB'][['adsh','cik','name','ein','former','changed','afs','wksi','fye','form','period','filed']].head()

Unnamed: 0,adsh,cik,name,ein,former,changed,afs,wksi,fye,form,period,filed
0,0000002178-18-000067,2178,"ADAMS RESOURCES & ENERGY, INC.",741753147.0,ADAMS RESOURCES & ENERGY INC,19920703.0,2-ACC,0,1231.0,10-Q,20180930,20181107
1,0000002488-18-000189,2488,ADVANCED MICRO DEVICES INC,941692300.0,,,1-LAF,0,1231.0,10-Q,20180930,20181031
2,0000002969-18-000044,2969,AIR PRODUCTS & CHEMICALS INC /DE/,231274455.0,,,1-LAF,1,930.0,10-K,20180930,20181120
3,0000003499-18-000023,3499,ALEXANDERS INC,510100517.0,,,1-LAF,0,1231.0,10-Q,20180930,20181029
4,0000003545-18-000108,3545,ALICO INC,590906081.0,ALICO LAND DEVELOPMENT CO,19740219.0,2-ACC,0,930.0,10-K,20180930,20181206


Taking a closer look at the Submissions Dataframe:
* CIK (Company Index Key) is fully contained in the ADSH (unique filing identifier).
* Former company names, along with the change date (as integer), are shown. This could be helpful for further company research.
* 'form' lists the form name, making filtering for 10-Qs easy.
* 'fye' refers the the fiscal year end date.
* 'period' indicates what quarter-ending-on-date this filing refers to.
* 'filed' refers to the document filing date.  

'ein', 'afs' and 'wksi' hold no relevance to our analysis (based on a reading of readme.html).
Retaining only the relevant columns yields:

In [4]:
sub_relevant_columns = DataFrames['SUB'][['adsh','cik','name','former','changed','fye','form','period']]

In [5]:
sub_relevant_columns.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6421 entries, 0 to 6420
Data columns (total 8 columns):
adsh       6421 non-null object
cik        6421 non-null int64
name       6421 non-null object
former     3596 non-null object
changed    3596 non-null float64
fye        6419 non-null float64
form       6421 non-null object
period     6421 non-null int64
dtypes: float64(2), int64(2), object(4)
memory usage: 401.4+ KB


Data quality is good overall. Only 2 filings appear to be missing Fiscal Year End which is not likely to be important in the analysis. Steps needed to make the data useful for our needs:  
* Filter the table on 10-Q forms.

In [6]:
filings_to_obtain = sub_relevant_columns[sub_relevant_columns['form'] == '10-Q']

In [7]:
filings_to_obtain.head()

Unnamed: 0,adsh,cik,name,former,changed,fye,form,period
0,0000002178-18-000067,2178,"ADAMS RESOURCES & ENERGY, INC.",ADAMS RESOURCES & ENERGY INC,19920703.0,1231.0,10-Q,20180930
1,0000002488-18-000189,2488,ADVANCED MICRO DEVICES INC,,,1231.0,10-Q,20180930
3,0000003499-18-000023,3499,ALEXANDERS INC,,,1231.0,10-Q,20180930
5,0000003570-18-000160,3570,CHENIERE ENERGY INC,BEXY COMMUNICATIONS INC,19940314.0,1231.0,10-Q,20180930
7,0000004281-18-000127,4281,ARCONIC INC.,ALCOA INC.,20141003.0,1231.0,10-Q,20180930


In [8]:
# to open the text file
import requests

# to scrape the text file
from bs4 import BeautifulSoup

# to pause the script
from time import sleep
from random import randint

def download_filing(row):
    adsh = str(row['adsh'])
    cik = str(row['cik'])
    adsh_stripped = str(adsh.replace('-', ''))
    url = 'https://www.sec.gov/Archives/edgar/data/' + cik + '/' + adsh_stripped + '/' + adsh + '.txt'
    
    # web scraping best practice to avoid overloading the server
    sleep(randint(2,6))
    r = requests.get(url).text
    print('downloaded ' + url)
    print('starts with ' + r[:50])
    return r

### This will be slow - run overnight

# Add a column 'content' in which the filing content is stored    
# filings = filings_to_obtain.copy()
# filings['content'] = filings.apply(download_filing, axis=1)