# 2. Feature Generation

This module has the following purpose:

1. For every company, and for the quarters from 2009Q1 to 2018Q4, identify the active CEO and CFO in that quarter.

## 2.1 CEO/CFO identification
  
WikiData is not robust enough for this task. In fact, of the 5 first companies in the companies list, only 1 had info on their current CEO. None of the first 5 companies had any information on past CEOs, or any CFO info.

CEO and CFO are required to sign a legal certification as part of the filing. These forms are known as EXHIBIT 31.1 and EXHIBIT 31.2. This form always starts with: `I, Steven Roth, certify that:`  
Steven Roth is either CEO or CFO. Further down in this section the officer does sign with name and title, thus yielding CEO or CFO.  

These exhibits can be accessed directly, without the need to obtain and parse the full statement.
These exhibits are available on a 'Filing Detail' page with a structured URL.  
eg. https://www.sec.gov/Archives/edgar/data/**18498/000001849818000048/0000018498-18-000048**-index.htm
Note the structured portion of the url is made up of CIK and ADSH.

Due to the large volume of filings, we will only be selecting annual 10-K filings rather than quarterly 10-Qs.

Summarized:
1. Get the list of adsh filing codes.
2. Access the URL that yields Filing Detail page.
3. Use BeautifulSoup to identify the URLs to EXHIBIT 31.1 and 31.2.
3. Use Regex and/or text matching to identify the officers.
4. Use Regex and/or text matching to identify their titles.

This will yield:  

| CIK        | Quarter           | CEO  | CFO |
| :------------ |:-------------:| -----:|-----:|
| 00002354   | 2015Q1 | Jim Jones | Tim Bucks |
| 00002354   | 2015Q2      |   Jim Jones | Tim Bucks |
| 00002354 | 2015Q3      | Jane Jackson | Tim Bucks |



In [1]:
# to work with data
import pandas as pd

# to work with regex
import re

# to download
import requests

# to work with HTML tags
from bs4 import BeautifulSoup

# to time functions
import datetime

# to use NaN
import numpy as np

# to pause
import time
import random

# to work with local files
import os

In [54]:
# restore the pickled dictionary of dataframes
try: 
    DataFrames
except NameError:
    DataFrames = pd.read_pickle('dict_of_dfs_num_pre_sub_tag.p')

In [4]:
# select the adsh and cik codes to build the URL
filings_to_obtain = DataFrames['SUB'][DataFrames['SUB']['form'] == '10-K'].copy()
filings_to_obtain = filings_to_obtain[['adsh', 'cik']].copy()

In [27]:
filings_to_obtain.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48269 entries, 2 to 0
Data columns (total 3 columns):
adsh    48269 non-null object
cik     48269 non-null object
URL     48269 non-null object
dtypes: object(3)
memory usage: 1.5+ MB


In [28]:
# Build URLs to access filings based on CIK and ADSH-identifiers.

def build_filing_detail_url(row):
    '''
    Returns url of format
    https://www.sec.gov/Archives/edgar/data/18498/000001849818000048/0000018498-18-000048-index.htm
    based on adsh and cik codes.
    '''
    adsh = str(row['adsh'])
    cik = str(row['cik'])
    adsh_stripped = adsh.replace('-','')
    
    url = 'https://www.sec.gov/Archives/edgar/data/' + cik + '/' + adsh_stripped + '/' + adsh + '-index.htm'
    return url

filings_to_obtain['URL'] = filings_to_obtain.apply(build_filing_detail_url, axis=1)

filings_to_obtain.head(3)

Unnamed: 0,adsh,cik,URL
2,0000002969-18-000044,2969,https://www.sec.gov/Archives/edgar/data/2969/0...
4,0000003545-18-000108,3545,https://www.sec.gov/Archives/edgar/data/3545/0...
6,0000004127-18-000046,4127,https://www.sec.gov/Archives/edgar/data/4127/0...


In [50]:
# Clean text of HTML codes and stray tags.
def clean(text):
    return BeautifulSoup(text).get_text()

def get_filings(URL):
    # Polite web scraping
    time.sleep(random.randint(1,3)/3)
    
    r = requests.get(URL)
    soup = BeautifulSoup(r.text)
    
    # Store downloaded exhibits in a string.
    all_exh = ''
    
    # Check all links on the filing page
    for a in soup.find_all('a', href=True):
        # find Exhibits like 'exhibit31.htm', 'fooexh31bar.html'
        if re.search(r'.*?ex[a-z]{0,5}31.*?.htm.?', str(a)):
            # The hrefs are relative to the domain: add domain.
            exhibit_url = 'https://www.sec.gov'+a['href']
            exhibit_title = a.string
            
            r_ex = requests.get(exhibit_url)
            exh_text = clean(r_ex.text)
            
            all_exh += exhibit_title
            all_exh += exh_text
    
    global counter
    counter += 1

    if all_exh:
        return all_exh
    
    else:
        return np.nan

if not 'filings_0_10.csv' in os.listdir('./filings_csvs/'):     
    filings_count = len(filings_to_obtain)
    counter = 0

    a = 0
    b = 10
    while a < filings_count:
        df = filings_to_obtain[a:b].copy()
        df['Filings'] = df['URL'].apply(get_filings)
        df.to_csv('./filings_csvs/filings_'+str(a)+'_'+str(b)+'.csv')
        print('processed:', counter, 'from to:',a,b)
        a = b
        b += 500
else: 
    print('Filings already downloaded')

Filings already downloaded


In [37]:
df = pd.read_csv('./filings_csvs/filings_0_10.csv', index_col=0)
df.head(3)

Unnamed: 0,adsh,cik,URL,Filings
2,0000002969-18-000044,2969,https://www.sec.gov/Archives/edgar/data/2969/0...,apd-exhibit311x30sep20.htm\nEX-31.1\n8\napd-ex...
4,0000003545-18-000108,3545,https://www.sec.gov/Archives/edgar/data/3545/0...,exhibit311q42018.htm\nEX-31.1\n8\nexhibit311q4...
6,0000004127-18-000046,4127,https://www.sec.gov/Archives/edgar/data/4127/0...,fy1810k92818ex311.htm\nEX-31.1\n5\nfy1810k9281...


In [45]:
# Extract the officer name and title with Regex.

def extract_names_titles(exh):
    regex = r'I,.(.+?),?.certify.that:.+?(Chief.+?Officer|(?<!Vice )President(?! and Chief))'
    exh = clean(str(exh))
    match = re.findall(regex, str(exh), re.DOTALL)
    if match:
        return match
    else:
        return np.nan

try: companies_names
except NameError:
    list_of_df = [pd.DataFrame(columns=['adsh', 'cik', 'URL'])] 

    for datafile in os.listdir('./filings_csvs/'):
        df = pd.read_csv('./filings_csvs/'+datafile, index_col=0)
        df['Officers'] = df['Filings'].apply(extract_names_titles)
        df.drop(['Filings'], axis=1, inplace=True)
        list_of_df.append(df)
        print('processed', datafile)

    companies_names = pd.concat(list_of_df, sort=False)
companies_names.head(3)

Unnamed: 0,Officers,URL,adsh,cik
2,"[(Seifi Ghasemi, Chief Executive Officer), (M....",https://www.sec.gov/Archives/edgar/data/2969/0...,0000002969-18-000044,2969
4,"[(Henry R. Slack, Chief Financial Officer)]",https://www.sec.gov/Archives/edgar/data/3545/0...,0000003545-18-000108,3545
6,"[(Liam K. Griffin, Chief Executive Officer), (...",https://www.sec.gov/Archives/edgar/data/4127/0...,0000004127-18-000046,4127


In [52]:
# The names and titles are embedded in a list of tuples. 
# This cell structures them into two proper columns.

def return_name_given_title(officer_list, desired_titles):
    if officer_list and type(officer_list) is list:
        result = []
        for name, title in officer_list:
            if title in desired_titles and name not in result:
                result.append(name)
        if result:
            res_str = str(result[0])        
            if ',' in res_str:
                commapos = res_str.find(',')
                return res_str[:commapos]
            else:
                return res_str
        else:
            return np.nan
        
companies_names['CEO'] = companies_names['Officers'].apply(return_name_given_title, args=(
                                                                     ['Chief Executive Officer', 'President'], ))
companies_names['CFO'] = companies_names['Officers'].apply(return_name_given_title, args=(
                                                                     ['Chief Financial Officer'], ))

companies_names.head(3)

Unnamed: 0,Officers,URL,adsh,cik,CEO,CFO
2,"[(Seifi Ghasemi, Chief Executive Officer), (M....",https://www.sec.gov/Archives/edgar/data/2969/0...,0000002969-18-000044,2969,Seifi Ghasemi,M. Scott Crocco
4,"[(Henry R. Slack, Chief Financial Officer)]",https://www.sec.gov/Archives/edgar/data/3545/0...,0000003545-18-000108,3545,,Henry R. Slack
6,"[(Liam K. Griffin, Chief Executive Officer), (...",https://www.sec.gov/Archives/edgar/data/4127/0...,0000004127-18-000046,4127,Liam K. Griffin,Kris Sennesael


In [68]:
# Combine the results with the submission information ('SUB') dataframe.

CompInfo = DataFrames['SUB'][['adsh', 'cik', 'name', 'stprba', 'period', 'fy', 'fp', 'filed']][DataFrames['SUB']['form'] == '10-K'].copy()
Company_CEO_CFO = pd.merge(
    CompInfo, 
    companies_names[['adsh', 'CEO', 'CFO']], 
    left_on='adsh', 
    right_on='adsh', 
    how='left')

In [71]:
CEO_found = sum(Company_CEO_CFO['CEO'].notnull()) / len(Company_CEO_CFO['adsh'])
CFO_found = sum(Company_CEO_CFO['CFO'].notnull()) / len(Company_CEO_CFO['adsh'])

print('CEOs found:',CEO_found,'CFOs found:',CFO_found)

Company_CEO_CFO.info()

CEOs found: 0.7822826244587624 CFOs found: 0.6375106175806419
<class 'pandas.core.frame.DataFrame'>
Int64Index: 48269 entries, 0 to 48268
Data columns (total 10 columns):
adsh      48269 non-null object
cik       48269 non-null object
name      48269 non-null object
stprba    45434 non-null object
period    48269 non-null object
fy        48269 non-null object
fp        48269 non-null object
filed     48269 non-null object
CEO       37760 non-null object
CFO       30772 non-null object
dtypes: object(10)
memory usage: 4.1+ MB


In [72]:
Company_CEO_CFO.head(3)

Unnamed: 0,adsh,cik,name,stprba,period,fy,fp,filed,CEO,CFO
0,0000002969-18-000044,2969,AIR PRODUCTS & CHEMICALS INC /DE/,PA,20180930,2018,FY,20181120,Seifi Ghasemi,M. Scott Crocco
1,0000003545-18-000108,3545,ALICO INC,FL,20180930,2018,FY,20181206,,Henry R. Slack
2,0000004127-18-000046,4127,"SKYWORKS SOLUTIONS, INC.",MA,20180930,2018,FY,20181115,Liam K. Griffin,Kris Sennesael


In [73]:
Company_CEO_CFO[Company_CEO_CFO.name == 'ADVANCED MICRO DEVICES INC']

Unnamed: 0,adsh,cik,name,stprba,period,fy,fp,filed,CEO,CFO
1933,0000002488-18-000042,2488,ADVANCED MICRO DEVICES INC,CA,20171231,2017,FY,20180227,Lisa T. Su,Devinder Kumar
7640,0000002488-17-000043,2488,ADVANCED MICRO DEVICES INC,CA,20161231,2016,FY,20170221,Lisa T. Su,Devinder Kumar
13849,0000002488-16-000111,2488,ADVANCED MICRO DEVICES INC,CA,20151231,2015,FY,20160218,Lisa T. Su,Devinder Kumar
22674,0001193125-15-054362,2488,ADVANCED MICRO DEVICES INC,CA,20141231,2014,FY,20150219,Lisa T. Su,Devinder Kumar
29734,0001193125-14-057240,2488,ADVANCED MICRO DEVICES INC,CA,20131231,2013,FY,20140218,Rory P. Read,Devinder Kumar
36966,0001193125-13-069422,2488,ADVANCED MICRO DEVICES INC,CA,20121231,2012,FY,20130221,Rory P. Read,Devinder Kumar
43784,0001193125-12-075837,2488,ADVANCED MICRO DEVICES INC,CA,20111231,2011,FY,20120224,Rory P. Read,Thomas J. Seifert
47351,0001193125-11-040392,2488,ADVANCED MICRO DEVICES INC,CA,20101231,2010,FY,20110218,,Thomas J. Seifert


In [74]:
if not 'Company_CEO_CFO.csv' in os.listdir('.'):
    Company_CEO_CFO.to_csv('Company_CEO_CFO.csv')

## Result

We now have a CSV of CEO and CFO on filing date.