# Feature Generation - Part 2

In this part of the FEATURE GENERATION module, we will refine the officer extraction logic.

In [166]:
# to work with data
import pandas as pd

# make the display wider
pd.set_option('max_colwidth',100)

# to work with regex
import re

# to work with HTML tags
from bs4 import BeautifulSoup

# to time functions
import datetime

# to use NaN
import numpy as np

In [6]:
# Load the first chunk of 50 from disk

listings_iter = pd.read_csv('filings.csv', chunksize=50)
listings_first50 = next(listings_iter)
listings_first50.head(50)

Unnamed: 0.1,Unnamed: 0,adsh,cik,name,former,changed,fye,form,period,content
0,0,0000002178-18-000067,2178,"ADAMS RESOURCES & ENERGY, INC.",ADAMS RESOURCES & ENERGY INC,19920703.0,1231.0,10-Q,20180930,<SEC-DOCUMENT>0000002178-18-000067.txt : 20181...
1,1,0000002488-18-000189,2488,ADVANCED MICRO DEVICES INC,,,1231.0,10-Q,20180930,<SEC-DOCUMENT>0000002488-18-000189.txt : 20181...
2,3,0000003499-18-000023,3499,ALEXANDERS INC,,,1231.0,10-Q,20180930,<SEC-DOCUMENT>0000003499-18-000023.txt : 20181...
3,5,0000003570-18-000160,3570,CHENIERE ENERGY INC,BEXY COMMUNICATIONS INC,19940314.0,1231.0,10-Q,20180930,<SEC-DOCUMENT>0000003570-18-000160.txt : 20181...
4,7,0000004281-18-000127,4281,ARCONIC INC.,ALCOA INC.,20141003.0,1231.0,10-Q,20180930,<SEC-DOCUMENT>0000004281-18-000127.txt : 20181...
5,8,0000004457-18-000054,4457,AMERCO /NV/,AMERCO,19770926.0,331.0,10-Q,20180930,<SEC-DOCUMENT>0000004457-18-000054.txt : 20181...
6,9,0000004904-18-000055,4904,AMERICAN ELECTRIC POWER CO INC,KINGSPORT UTILITIES INC,19660906.0,1231.0,10-Q,20180930,<SEC-DOCUMENT>0000004904-18-000055.txt : 20181...
7,10,0000004962-18-000121,4962,AMERICAN EXPRESS CO,,,1231.0,10-Q,20180930,<SEC-DOCUMENT>0000004962-18-000121.txt : 20181...
8,11,0000004969-18-000024,4969,AMERICAN EXPRESS CREDIT CORP,,,1231.0,10-Q,20180930,<SEC-DOCUMENT>0000004969-18-000024.txt : 20181...
9,12,0000004977-18-000152,4977,AFLAC INC,AMERICAN FAMILY CORP,19920306.0,1231.0,10-Q,20180930,<SEC-DOCUMENT>0000004977-18-000152.txt : 20181...


In [88]:
def regex_officers(filing_text, return_as_string=False):
    expression = r'I, (.+?), certify that'
    match = re.findall(expression, filing_text)
    if as_string:
        return str(match)
    if not as_string:
        return match

officers = pd.DataFrame()

officers['adsh'] = listings_first50['adsh'].copy()
# Store the list as a string for some initial text manipulation functions
officers['first_attempt_at_extraction_as_str'] = listings_first50['content'].apply(
    regex_officers, return_as_string=True)
officers['first_attempt_at_extraction'] = listings_first50['content'].apply(regex_officers)

officers.head(50)

Unnamed: 0,adsh,first_attempt_at_extraction_as_str,first_attempt_at_extraction
0,0000002178-18-000067,"['Townes G. Pressler', '</font><font style=""background-color:rgb(255,255,255, 0.0);color:#000000;font-family:Times New Roman;font-size:10pt;line-height:120%;"">Tracy E. Ohmart</font><font style=""ba...","[Townes G. Pressler, </font><font style=""background-color:rgb(255,255,255, 0.0);color:#000000;font-family:Times New Roman;font-size:10pt;line-height:120%;"">Tracy E. Ohmart</font><font style=""backg..."
1,0000002488-18-000189,"['Lisa T. Su', 'Devinder Kumar']","[Lisa T. Su, Devinder Kumar]"
2,0000003499-18-000023,"['Steven Roth', 'Matthew Iocco']","[Steven Roth, Matthew Iocco]"
3,0000003570-18-000160,"['</font><font style=""font-family:inherit;font-size:10pt;"">Jack A. Fusco</font><font style=""font-family:inherit;font-size:10pt;"">', 'Michael J. Wortley']","[</font><font style=""font-family:inherit;font-size:10pt;"">Jack A. Fusco</font><font style=""font-family:inherit;font-size:10pt;"">, Michael J. Wortley]"
4,0000004281-18-000127,"['Charles P. Blankenship', 'Ken Giacobbe']","[Charles P. Blankenship, Ken Giacobbe]"
5,0000004457-18-000054,"['Edward J. Shoen', 'Jason A. Berg']","[Edward J. Shoen, Jason A. Berg]"
6,0000004904-18-000055,"['Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Brian X. Tierney', 'Brian ...","[Nicholas K. Akins, Nicholas K. Akins, Nicholas K. Akins, Nicholas K. Akins, Nicholas K. Akins, Nicholas K. Akins, Nicholas K. Akins, Nicholas K. Akins, Brian X. Tierney, Brian X. Tierney, Brian X..."
7,0000004962-18-000121,"['Stephen J. Squeri', 'Jeffrey C. Campbell']","[Stephen J. Squeri, Jeffrey C. Campbell]"
8,0000004969-18-000024,"['David L. Yowan', 'Anderson Y. Lee']","[David L. Yowan, Anderson Y. Lee]"
9,0000004977-18-000152,"['Daniel P. Amos', 'Frederick J. Crawford']","[Daniel P. Amos, Frederick J. Crawford]"


## Issues in this first attempt

The regex..

1. returns HTML style tags (3 out of 50 attempts)
2. contains HTML Codes (2 out of 50 attempts)
3. comes back empty (1 out of 50 attempts)
4. comes back with duplicate names (5 out of 50 attempts)
5. comes back with officer title appended (4 out of 50 attempts)

Let's explore these in more detail.

In [74]:
# count entries with HTML style tags
sum(officers['first_attempt_at_extraction_as_str'].str.contains('font style'))

3

In [75]:
# count entries with HTML Codes
sum(officers['first_attempt_at_extraction_as_str'].str.contains('&#'))

2

In [76]:
# count empty entries
sum(officers['first_attempt_at_extraction_as_str'].str.contains('\[\]'))

1

In [78]:
officers[officers['first_attempt_at_extraction_as_str'].str.len() > 60]

Unnamed: 0,adsh,first_attempt_at_extraction_as_str
0,0000002178-18-000067,"['Townes G. Pressler', '</font><font style=""background-color:rgb(255,255,255, 0.0);color:#000000;font-family:Times New Roman;font-size:10pt;line-height:120%;"">Tracy E. Ohmart</font><font style=""ba..."
3,0000003570-18-000160,"['</font><font style=""font-family:inherit;font-size:10pt;"">Jack A. Fusco</font><font style=""font-family:inherit;font-size:10pt;"">', 'Michael J. Wortley']"
6,0000004904-18-000055,"['Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Brian X. Tierney', 'Brian ..."
12,0000006201-18-000039,"['W. Douglas Parker', 'Derek J. Kerr', 'W. Douglas Parker', 'Derek J. Kerr']"
16,0000007084-18-000039,"['J. R. Luciano', 'R. G. Young', 'J. R. Luciano, Chairman, Chief Executive Officer, and President of the Company', 'R. G. Young, Executive Vice President and Chief Financial Officer of the Company']"
19,0000008063-18-000029,"['Peter J. Gundermann, President and Chief Executive Officer', 'David C. Burney, Chief Financial Officer']"
27,0000010254-18-000080,"['Frank A. Lodzinski', 'Tony Oviedo', 'Frank A. Lodzinski', 'Tony Oviedo']"
29,0000011544-18-000089,"['W. Robert Berkley, Jr., President and Chief Executive Officer of W. R. Berkley Corporation (the &#8220;registrant&#8221;)', 'Richard M. Baio, Senior Vice President - Chief Financial Officer and ..."
32,0000012927-18-000065,"['</font><font style=""font-family:Arial;font-size:10pt;"">Dennis A. Muilenburg</font><font style=""font-family:Arial;font-size:10pt;"">', '</font><font style=""font-family:Arial;font-size:10pt;"">Grego..."
38,0000014930-18-000148,"['Mark D. Schwabero', 'William L. Metzger', 'Mark D. Schwabero, Chief Executive Officer of Brunswick Corporation', 'William L. Metzger, Chief Financial Officer of Brunswick Corporation']"


In [85]:
# 1. To strip the HTML tags, use BeautifulSoup's get_text() method, as demonstrated:

print('before BS4: ')
print(officers['first_attempt_at_extraction_as_str'][0])

print('')
print('after BS4: ')
print(BeautifulSoup(officers['first_attempt_at_extraction_as_str'][0]).get_text())

before BS4: 
['Townes G. Pressler', '</font><font style="background-color:rgb(255,255,255, 0.0);color:#000000;font-family:Times New Roman;font-size:10pt;line-height:120%;">Tracy E. Ohmart</font><font style="background-color:rgb(255,255,255, 0.0);color:#000000;font-family:Times New Roman;font-size:10pt;line-height:120%;">']

after BS4: 
['Townes G. Pressler', 'Tracy E. Ohmart']


In [87]:
# 2. To interpret the HTML codes, also use BeautifulSoup's get_text() method, as demonstrated:

print('before BS4: ')
print(officers['first_attempt_at_extraction_as_str'][39])

print('')
print('after BS4: ')
print(BeautifulSoup(officers['first_attempt_at_extraction_as_str'][39]).get_text())

before BS4: 
['Jos&#233; R. Mas', 'George L. Pita']

after BS4: 
['José R. Mas', 'George L. Pita']


3. The empty entry is due to the name being split across 2 lines:

`style="font-size:9.0pt;line-height:12.0pt;">I, Brian
Duperreault, certify that:</font></font></p>`

To include this case, update regex with newline character:
`I, (.+?[\n]?.+?), certify that`

In [92]:
# 4. To remove duplicates, use list(set([])):

dupl_removed = list(set(officers['first_attempt_at_extraction'][6]))

print('before: ')
print(officers['first_attempt_at_extraction'][6])
print('')
print('after: ', dupl_removed)

before: 
['Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Brian X. Tierney', 'Brian X. Tierney', 'Brian X. Tierney', 'Brian X. Tierney', 'Brian X. Tierney', 'Brian X. Tierney', 'Brian X. Tierney', 'Brian X. Tierney']

after:  ['Brian X. Tierney', 'Nicholas K. Akins']


In [116]:
# 5. Extract officer title, if any

print(officers['first_attempt_at_extraction'][16][2])

expression_for_titles = r'(.+?),.+(Chief .+ Officer)'

match = re.findall(
    expression_for_titles, officers['first_attempt_at_extraction'][16][2])

print('')
print('With new regex: ')
for el in match:
    for i in el:
        print(i)

J. R. Luciano, Chairman, Chief Executive Officer, and President of the Company

With new regex: 
J. R. Luciano
Chief Executive Officer


In [167]:
# Putting it all together:

def regex_officers_corrected(filing_text, return_as_string=False):
    '''
    This is the updated function to be applied to the filings,
    taking into considerations lessons 1. through 5. above.
    '''
    # find the certification statement(s), ignoring newlines (3.)
    # note: this could not be implemented in an efficient way. Leave out for now.
    certify_statement = r'I, (.+?), certify that'
    matches = re.findall(certify_statement, filing_text)
    
    # process results as described in 1. to 5.
    processed_matches = []
    for match in matches:
        # strips HTML tags and codes (1., 2.)
        match_souped = BeautifulSoup(match).get_text()
        # Extract officer title (5.)
        expression_for_titles = r'(.+?),.+(Chief .+ Officer)'
        match_with_title = re.findall(expression_for_titles, match_souped)
        
        if match_with_title:
            for i in match_with_title:
                processed_matches.append(i)
        elif match_souped:
            processed_matches.append(match_souped)
    
    # remove duplicates (4.)
    processed_matches_uniques = list(set(processed_matches))
    
    # remove non-tuples (names without titles) if tuples (names with titles) are present
    if any(type(i) is tuple for i in processed_matches_uniques):
        while any(type(i) is not tuple for i in processed_matches_uniques):
            for i in processed_matches_uniques:
                if type(i) is not tuple:
                    processed_matches_uniques.remove(i)
    
    if return_as_string:
        return str(processed_matches_uniques)
    elif not processed_matches_uniques:
        return np.nan
    else:
        return processed_matches_uniques

start = datetime.datetime.now()
officers['second_attempt_at_extraction'] = listings_first50['content'].apply(regex_officers_corrected)
end = datetime.datetime.now()
print('for 50 entries: ', end-start)
print('for 5000 entries: ', (end-start)*100)

officers['second_attempt_at_extraction'].head(50)

for 50 entries:  0:00:16.379115
for 5000 entries:  0:27:17.911500


0                                                                              [Townes G. Pressler]
1                                                                      [Devinder Kumar, Lisa T. Su]
2                                                                      [Steven Roth, Matthew Iocco]
3                                                                              [Michael J. Wortley]
4                                                            [Charles P. Blankenship, Ken Giacobbe]
5                                                                  [Jason A. Berg, Edward J. Shoen]
6                                                             [Brian X. Tierney, Nicholas K. Akins]
7                                                          [Stephen J. Squeri, Jeffrey C. Campbell]
8                                                                 [David L. Yowan, Anderson Y. Lee]
9                                                           [Frederick J. Crawford, Daniel P. Amos]


In [168]:
officers.head(50)

Unnamed: 0,adsh,first_attempt_at_extraction_as_str,first_attempt_at_extraction,second_attempt_at_extraction
0,0000002178-18-000067,"['Townes G. Pressler', '</font><font style=""background-color:rgb(255,255,255, 0.0);color:#000000...","[Townes G. Pressler, </font><font style=""background-color:rgb(255,255,255, 0.0);color:#000000;fo...",[Townes G. Pressler]
1,0000002488-18-000189,"['Lisa T. Su', 'Devinder Kumar']","[Lisa T. Su, Devinder Kumar]","[Devinder Kumar, Lisa T. Su]"
2,0000003499-18-000023,"['Steven Roth', 'Matthew Iocco']","[Steven Roth, Matthew Iocco]","[Steven Roth, Matthew Iocco]"
3,0000003570-18-000160,"['</font><font style=""font-family:inherit;font-size:10pt;"">Jack A. Fusco</font><font style=""font...","[</font><font style=""font-family:inherit;font-size:10pt;"">Jack A. Fusco</font><font style=""font-...",[Michael J. Wortley]
4,0000004281-18-000127,"['Charles P. Blankenship', 'Ken Giacobbe']","[Charles P. Blankenship, Ken Giacobbe]","[Charles P. Blankenship, Ken Giacobbe]"
5,0000004457-18-000054,"['Edward J. Shoen', 'Jason A. Berg']","[Edward J. Shoen, Jason A. Berg]","[Jason A. Berg, Edward J. Shoen]"
6,0000004904-18-000055,"['Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K...","[Nicholas K. Akins, Nicholas K. Akins, Nicholas K. Akins, Nicholas K. Akins, Nicholas K. Akins, ...","[Brian X. Tierney, Nicholas K. Akins]"
7,0000004962-18-000121,"['Stephen J. Squeri', 'Jeffrey C. Campbell']","[Stephen J. Squeri, Jeffrey C. Campbell]","[Stephen J. Squeri, Jeffrey C. Campbell]"
8,0000004969-18-000024,"['David L. Yowan', 'Anderson Y. Lee']","[David L. Yowan, Anderson Y. Lee]","[David L. Yowan, Anderson Y. Lee]"
9,0000004977-18-000152,"['Daniel P. Amos', 'Frederick J. Crawford']","[Daniel P. Amos, Frederick J. Crawford]","[Frederick J. Crawford, Daniel P. Amos]"


## Review:

The new function correctly yields 48 of 50 officer-pairs. If the officer signs the declaration with their title, the title is extracted into a tuple (name, title).  

Two entries return as empty lists, which are replaced by np.nan. This is due to BeautifulSoup failing to extract the name from among the HTML-tags. Nothing about the code appeared out of the ordinary, and a Google search did not yield any answers, so accept this loss (4%) and move on.

Next, we will write a second function that extracts the title for the names without title.

In [156]:
#certify that:.+?(Chief .+? Officer)


