# Feature Generation - Part 2

In this part of the FEATURE GENERATION module, we will refine the officer extraction logic.

In [1]:
# to work with data
import pandas as pd

# make the display wider
pd.set_option('max_colwidth',100)

# to work with regex
import re

# to work with HTML tags
from bs4 import BeautifulSoup

# to time functions
import datetime

# to use NaN
import numpy as np

In [2]:
# Load the first chunk of 50 from disk

listings_iter = pd.read_csv('filings.csv', chunksize=50)
listings_first50 = next(listings_iter)
listings_first50.head(50)

Unnamed: 0.1,Unnamed: 0,adsh,cik,name,former,changed,fye,form,period,content
0,0,0000002178-18-000067,2178,"ADAMS RESOURCES & ENERGY, INC.",ADAMS RESOURCES & ENERGY INC,19920703.0,1231.0,10-Q,20180930,<SEC-DOCUMENT>0000002178-18-000067.txt : 20181107\n<SEC-HEADER>0000002178-18-000067.hdr.sgml : 2...
1,1,0000002488-18-000189,2488,ADVANCED MICRO DEVICES INC,,,1231.0,10-Q,20180930,<SEC-DOCUMENT>0000002488-18-000189.txt : 20181031\n<SEC-HEADER>0000002488-18-000189.hdr.sgml : 2...
2,3,0000003499-18-000023,3499,ALEXANDERS INC,,,1231.0,10-Q,20180930,<SEC-DOCUMENT>0000003499-18-000023.txt : 20181029\n<SEC-HEADER>0000003499-18-000023.hdr.sgml : 2...
3,5,0000003570-18-000160,3570,CHENIERE ENERGY INC,BEXY COMMUNICATIONS INC,19940314.0,1231.0,10-Q,20180930,<SEC-DOCUMENT>0000003570-18-000160.txt : 20181108\n<SEC-HEADER>0000003570-18-000160.hdr.sgml : 2...
4,7,0000004281-18-000127,4281,ARCONIC INC.,ALCOA INC.,20141003.0,1231.0,10-Q,20180930,<SEC-DOCUMENT>0000004281-18-000127.txt : 20181101\n<SEC-HEADER>0000004281-18-000127.hdr.sgml : 2...
5,8,0000004457-18-000054,4457,AMERCO /NV/,AMERCO,19770926.0,331.0,10-Q,20180930,<SEC-DOCUMENT>0000004457-18-000054.txt : 20181107\n<SEC-HEADER>0000004457-18-000054.hdr.sgml : 2...
6,9,0000004904-18-000055,4904,AMERICAN ELECTRIC POWER CO INC,KINGSPORT UTILITIES INC,19660906.0,1231.0,10-Q,20180930,<SEC-DOCUMENT>0000004904-18-000055.txt : 20181025\n<SEC-HEADER>0000004904-18-000055.hdr.sgml : 2...
7,10,0000004962-18-000121,4962,AMERICAN EXPRESS CO,,,1231.0,10-Q,20180930,<SEC-DOCUMENT>0000004962-18-000121.txt : 20181023\n<SEC-HEADER>0000004962-18-000121.hdr.sgml : 2...
8,11,0000004969-18-000024,4969,AMERICAN EXPRESS CREDIT CORP,,,1231.0,10-Q,20180930,<SEC-DOCUMENT>0000004969-18-000024.txt : 20181102\n<SEC-HEADER>0000004969-18-000024.hdr.sgml : 2...
9,12,0000004977-18-000152,4977,AFLAC INC,AMERICAN FAMILY CORP,19920306.0,1231.0,10-Q,20180930,<SEC-DOCUMENT>0000004977-18-000152.txt : 20181101\n<SEC-HEADER>0000004977-18-000152.hdr.sgml : 2...


In [4]:
def regex_officers(filing_text, return_as_string=False):
    expression = r'I, (.+?), certify that'
    match = re.findall(expression, filing_text)
    if return_as_string:
        return str(match)
    else:
        return match

officers = pd.DataFrame()

officers['adsh'] = listings_first50['adsh'].copy()
# Store the list as a string for some initial text manipulation functions
officers['first_attempt_at_extraction_as_str'] = listings_first50['content'].apply(
    regex_officers, return_as_string=True)
officers['first_attempt_at_extraction'] = listings_first50['content'].apply(regex_officers)

officers.head(50)

Unnamed: 0,adsh,first_attempt_at_extraction_as_str,first_attempt_at_extraction
0,0000002178-18-000067,"['Townes G. Pressler', '</font><font style=""background-color:rgb(255,255,255, 0.0);color:#000000...","[Townes G. Pressler, </font><font style=""background-color:rgb(255,255,255, 0.0);color:#000000;fo..."
1,0000002488-18-000189,"['Lisa T. Su', 'Devinder Kumar']","[Lisa T. Su, Devinder Kumar]"
2,0000003499-18-000023,"['Steven Roth', 'Matthew Iocco']","[Steven Roth, Matthew Iocco]"
3,0000003570-18-000160,"['</font><font style=""font-family:inherit;font-size:10pt;"">Jack A. Fusco</font><font style=""font...","[</font><font style=""font-family:inherit;font-size:10pt;"">Jack A. Fusco</font><font style=""font-..."
4,0000004281-18-000127,"['Charles P. Blankenship', 'Ken Giacobbe']","[Charles P. Blankenship, Ken Giacobbe]"
5,0000004457-18-000054,"['Edward J. Shoen', 'Jason A. Berg']","[Edward J. Shoen, Jason A. Berg]"
6,0000004904-18-000055,"['Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K...","[Nicholas K. Akins, Nicholas K. Akins, Nicholas K. Akins, Nicholas K. Akins, Nicholas K. Akins, ..."
7,0000004962-18-000121,"['Stephen J. Squeri', 'Jeffrey C. Campbell']","[Stephen J. Squeri, Jeffrey C. Campbell]"
8,0000004969-18-000024,"['David L. Yowan', 'Anderson Y. Lee']","[David L. Yowan, Anderson Y. Lee]"
9,0000004977-18-000152,"['Daniel P. Amos', 'Frederick J. Crawford']","[Daniel P. Amos, Frederick J. Crawford]"


## Issues in this first attempt

The regex..

1. returns HTML style tags (3 out of 50 attempts)
2. contains HTML Codes (2 out of 50 attempts)
3. comes back empty (1 out of 50 attempts)
4. comes back with duplicate names (5 out of 50 attempts)
5. comes back with officer title appended (4 out of 50 attempts)

Let's explore these in more detail.

In [5]:
# count entries with HTML style tags
sum(officers['first_attempt_at_extraction_as_str'].str.contains('font style'))

3

In [6]:
# count entries with HTML Codes
sum(officers['first_attempt_at_extraction_as_str'].str.contains('&#'))

2

In [7]:
# count empty entries
sum(officers['first_attempt_at_extraction_as_str'].str.contains('\[\]'))

1

In [8]:
officers[officers['first_attempt_at_extraction_as_str'].str.len() > 60]

Unnamed: 0,adsh,first_attempt_at_extraction_as_str,first_attempt_at_extraction
0,0000002178-18-000067,"['Townes G. Pressler', '</font><font style=""background-color:rgb(255,255,255, 0.0);color:#000000...","[Townes G. Pressler, </font><font style=""background-color:rgb(255,255,255, 0.0);color:#000000;fo..."
3,0000003570-18-000160,"['</font><font style=""font-family:inherit;font-size:10pt;"">Jack A. Fusco</font><font style=""font...","[</font><font style=""font-family:inherit;font-size:10pt;"">Jack A. Fusco</font><font style=""font-..."
6,0000004904-18-000055,"['Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K...","[Nicholas K. Akins, Nicholas K. Akins, Nicholas K. Akins, Nicholas K. Akins, Nicholas K. Akins, ..."
12,0000006201-18-000039,"['W. Douglas Parker', 'Derek J. Kerr', 'W. Douglas Parker', 'Derek J. Kerr']","[W. Douglas Parker, Derek J. Kerr, W. Douglas Parker, Derek J. Kerr]"
16,0000007084-18-000039,"['J. R. Luciano', 'R. G. Young', 'J. R. Luciano, Chairman, Chief Executive Officer, and Presiden...","[J. R. Luciano, R. G. Young, J. R. Luciano, Chairman, Chief Executive Officer, and President of ..."
19,0000008063-18-000029,"['Peter J. Gundermann, President and Chief Executive Officer', 'David C. Burney, Chief Financial...","[Peter J. Gundermann, President and Chief Executive Officer, David C. Burney, Chief Financial Of..."
27,0000010254-18-000080,"['Frank A. Lodzinski', 'Tony Oviedo', 'Frank A. Lodzinski', 'Tony Oviedo']","[Frank A. Lodzinski, Tony Oviedo, Frank A. Lodzinski, Tony Oviedo]"
29,0000011544-18-000089,"['W. Robert Berkley, Jr., President and Chief Executive Officer of W. R. Berkley Corporation (th...","[W. Robert Berkley, Jr., President and Chief Executive Officer of W. R. Berkley Corporation (the..."
32,0000012927-18-000065,"['</font><font style=""font-family:Arial;font-size:10pt;"">Dennis A. Muilenburg</font><font style=...","[</font><font style=""font-family:Arial;font-size:10pt;"">Dennis A. Muilenburg</font><font style=""..."
38,0000014930-18-000148,"['Mark D. Schwabero', 'William L. Metzger', 'Mark D. Schwabero, Chief Executive Officer of Bruns...","[Mark D. Schwabero, William L. Metzger, Mark D. Schwabero, Chief Executive Officer of Brunswick ..."


In [9]:
# 1. To strip the HTML tags, use BeautifulSoup's get_text() method, as demonstrated:

print('before BS4: ')
print(officers['first_attempt_at_extraction_as_str'][0])

print('')
print('after BS4: ')
print(BeautifulSoup(officers['first_attempt_at_extraction_as_str'][0]).get_text())

before BS4: 
['Townes G. Pressler', '</font><font style="background-color:rgb(255,255,255, 0.0);color:#000000;font-family:Times New Roman;font-size:10pt;line-height:120%;">Tracy E. Ohmart</font><font style="background-color:rgb(255,255,255, 0.0);color:#000000;font-family:Times New Roman;font-size:10pt;line-height:120%;">']

after BS4: 
['Townes G. Pressler', 'Tracy E. Ohmart']


In [10]:
# 2. To interpret the HTML codes, also use BeautifulSoup's get_text() method, as demonstrated:

print('before BS4: ')
print(officers['first_attempt_at_extraction_as_str'][39])

print('')
print('after BS4: ')
print(BeautifulSoup(officers['first_attempt_at_extraction_as_str'][39]).get_text())

before BS4: 
['Jos&#233; R. Mas', 'George L. Pita']

after BS4: 
['José R. Mas', 'George L. Pita']


3. The empty entry is due to the name being split across 2 lines:

`style="font-size:9.0pt;line-height:12.0pt;">I, Brian
Duperreault, certify that:</font></font></p>`

To include this case, update regex with newline character:
`I, (.+?[\n]?.+?), certify that`

In [11]:
# 4. To remove duplicates, use list(set([])):

dupl_removed = list(set(officers['first_attempt_at_extraction'][6]))

print('before: ')
print(officers['first_attempt_at_extraction'][6])
print('')
print('after: ', dupl_removed)

before: 
['Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Brian X. Tierney', 'Brian X. Tierney', 'Brian X. Tierney', 'Brian X. Tierney', 'Brian X. Tierney', 'Brian X. Tierney', 'Brian X. Tierney', 'Brian X. Tierney']

after:  ['Brian X. Tierney', 'Nicholas K. Akins']


In [12]:
# 5. Extract officer title, if any

print(officers['first_attempt_at_extraction'][16][2])

expression_for_titles = r'(.+?),.+(Chief .+ Officer)'

match = re.findall(
    expression_for_titles, officers['first_attempt_at_extraction'][16][2])

print('')
print('With new regex: ')
for el in match:
    for i in el:
        print(i)

J. R. Luciano, Chairman, Chief Executive Officer, and President of the Company

With new regex: 
J. R. Luciano
Chief Executive Officer


In [13]:
# Putting it all together:

def regex_officers_corrected(filing_text, return_as_string=False):
    '''
    This is the updated function to be applied to the filings,
    taking into considerations lessons 1. through 5. above.
    '''
    # find the certification statement(s), ignoring newlines (3.)
    # note: this could not be implemented in an efficient way. Leave out for now.
    certify_statement = r'I, (.+?), certify that'
    matches = re.findall(certify_statement, filing_text)
    
    # process results as described in 1. to 5.
    processed_matches = []
    for match in matches:
        # strips HTML tags and codes (1., 2.)
        match_souped = BeautifulSoup(match).get_text()
        # Extract officer title (5.)
        expression_for_titles = r'(.+?),.+(Chief .+ Officer)'
        match_with_title = re.findall(expression_for_titles, match_souped)
        
        if match_with_title:
            for i in match_with_title:
                processed_matches.append(i)
        elif match_souped:
            processed_matches.append(match_souped)
    
    # remove duplicates (4.)
    processed_matches_uniques = list(set(processed_matches))
    
    # remove non-tuples (names without titles) if tuples (names with titles) are present
    if any(type(i) is tuple for i in processed_matches_uniques):
        while any(type(i) is not tuple for i in processed_matches_uniques):
            for i in processed_matches_uniques:
                if type(i) is not tuple:
                    processed_matches_uniques.remove(i)
    
    if return_as_string:
        return str(processed_matches_uniques)
    elif not processed_matches_uniques:
        return np.nan
    else:
        return processed_matches_uniques

start = datetime.datetime.now()
officers['second_attempt_at_extraction'] = listings_first50['content'].apply(regex_officers_corrected)
end = datetime.datetime.now()
print('for 50 entries: ', end-start)
print('for 5000 entries: ', (end-start)*100)

officers['second_attempt_at_extraction'].head(50)

for 50 entries:  0:00:16.624692
for 5000 entries:  0:27:42.469200


0                                                                              [Townes G. Pressler]
1                                                                      [Devinder Kumar, Lisa T. Su]
2                                                                      [Matthew Iocco, Steven Roth]
3                                                                              [Michael J. Wortley]
4                                                            [Ken Giacobbe, Charles P. Blankenship]
5                                                                  [Edward J. Shoen, Jason A. Berg]
6                                                             [Brian X. Tierney, Nicholas K. Akins]
7                                                          [Jeffrey C. Campbell, Stephen J. Squeri]
8                                                                 [David L. Yowan, Anderson Y. Lee]
9                                                           [Daniel P. Amos, Frederick J. Crawford]


In [14]:
officers.head(50)

Unnamed: 0,adsh,first_attempt_at_extraction_as_str,first_attempt_at_extraction,second_attempt_at_extraction
0,0000002178-18-000067,"['Townes G. Pressler', '</font><font style=""background-color:rgb(255,255,255, 0.0);color:#000000...","[Townes G. Pressler, </font><font style=""background-color:rgb(255,255,255, 0.0);color:#000000;fo...",[Townes G. Pressler]
1,0000002488-18-000189,"['Lisa T. Su', 'Devinder Kumar']","[Lisa T. Su, Devinder Kumar]","[Devinder Kumar, Lisa T. Su]"
2,0000003499-18-000023,"['Steven Roth', 'Matthew Iocco']","[Steven Roth, Matthew Iocco]","[Matthew Iocco, Steven Roth]"
3,0000003570-18-000160,"['</font><font style=""font-family:inherit;font-size:10pt;"">Jack A. Fusco</font><font style=""font...","[</font><font style=""font-family:inherit;font-size:10pt;"">Jack A. Fusco</font><font style=""font-...",[Michael J. Wortley]
4,0000004281-18-000127,"['Charles P. Blankenship', 'Ken Giacobbe']","[Charles P. Blankenship, Ken Giacobbe]","[Ken Giacobbe, Charles P. Blankenship]"
5,0000004457-18-000054,"['Edward J. Shoen', 'Jason A. Berg']","[Edward J. Shoen, Jason A. Berg]","[Edward J. Shoen, Jason A. Berg]"
6,0000004904-18-000055,"['Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K. Akins', 'Nicholas K...","[Nicholas K. Akins, Nicholas K. Akins, Nicholas K. Akins, Nicholas K. Akins, Nicholas K. Akins, ...","[Brian X. Tierney, Nicholas K. Akins]"
7,0000004962-18-000121,"['Stephen J. Squeri', 'Jeffrey C. Campbell']","[Stephen J. Squeri, Jeffrey C. Campbell]","[Jeffrey C. Campbell, Stephen J. Squeri]"
8,0000004969-18-000024,"['David L. Yowan', 'Anderson Y. Lee']","[David L. Yowan, Anderson Y. Lee]","[David L. Yowan, Anderson Y. Lee]"
9,0000004977-18-000152,"['Daniel P. Amos', 'Frederick J. Crawford']","[Daniel P. Amos, Frederick J. Crawford]","[Daniel P. Amos, Frederick J. Crawford]"


## Review:

The new function correctly yields 48 of 50 officer-pairs. If the officer signs the declaration with their title, the title is extracted into a tuple (name, title).  

Two entries return as empty lists, which are replaced by np.nan. This is due to BeautifulSoup failing to extract the name from among the HTML-tags. Nothing about the code appeared out of the ordinary, and a Google search did not yield any answers, so accept this loss (4%) and move on.

Next, we will write a second function that extracts the title for the names without title.

## Title extraction with Regex

Similar approach to the name extraction: write a naive function, review the results and build a more robust solution to address deficiencies.

In [32]:
# Clean up the working file

try:
    officers = officers[['adsh', 'second_attempt_at_extraction']]
except:
    pass
    

officers

Unnamed: 0,adsh,officer_names,titles_firsttry
0,0000002178-18-000067,[Townes G. Pressler],[]
1,0000002488-18-000189,"[Devinder Kumar, Lisa T. Su]","[Chief Executive Officer, Chief Financial Officer]"
2,0000003499-18-000023,"[Matthew Iocco, Steven Roth]","[Chief Executive Officer, Chief Financial Officer]"
3,0000003570-18-000160,[Michael J. Wortley],"[Chief Executive Officer, Chief Financial Officer]"
4,0000004281-18-000127,"[Ken Giacobbe, Charles P. Blankenship]","[Chief Executive Officer, Chief Financial Officer]"
5,0000004457-18-000054,"[Edward J. Shoen, Jason A. Berg]",[]
6,0000004904-18-000055,"[Brian X. Tierney, Nicholas K. Akins]","[Chief Executive Officer, Chief Executive Officer]"
7,0000004962-18-000121,"[Jeffrey C. Campbell, Stephen J. Squeri]",[]
8,0000004969-18-000024,"[David L. Yowan, Anderson Y. Lee]",[]
9,0000004977-18-000152,"[Daniel P. Amos, Frederick J. Crawford]","[Chief Executive Officer, Chief Financial Officer]"


In [147]:
# certify that:.+?(Chief .+? Officer) FLAGS: GM

def regex_titles_fast(filing_text, return_as_string=False):
    titles_expression=r'certify that:.+?(Chief .+? Officer)'
    matches = re.findall(titles_expression, filing_text, re.MULTILINE)
    # only the first two matches are relevant
    matches = matches[0:2]
    
    # clean up with bs4
    match_souped = []
    for i in matches:
        i = BeautifulSoup(i).get_text(strip=True)
        match_souped.append(i)
    
    if match_souped == []:
        return np.nan
    else:
        return match_souped


officers['titles_firsttry'] = listings_first50['content'].apply(regex_titles_fast)
officers.head(50)

Unnamed: 0,adsh,officer_names,titles_firsttry,titles_2nd_try
0,0000002178-18-000067,[Townes G. Pressler],,[Chief Financial Officer]
1,0000002488-18-000189,"[Devinder Kumar, Lisa T. Su]","[Chief Executive Officer, Chief Financial Officer]",
2,0000003499-18-000023,"[Matthew Iocco, Steven Roth]","[Chief Executive Officer, Chief Financial Officer]",
3,0000003570-18-000160,[Michael J. Wortley],"[Chief Executive Officer, Chief Financial Officer]",
4,0000004281-18-000127,"[Ken Giacobbe, Charles P. Blankenship]","[Chief Executive Officer, Chief Financial Officer]",
5,0000004457-18-000054,"[Edward J. Shoen, Jason A. Berg]",,"[President, Chief Financial Officer]"
6,0000004904-18-000055,"[Brian X. Tierney, Nicholas K. Akins]","[Chief Executive Officer, Chief Executive Officer]",
7,0000004962-18-000121,"[Jeffrey C. Campbell, Stephen J. Squeri]",,"[Chief Executive Officer, Chief Financial Officer]"
8,0000004969-18-000024,"[David L. Yowan, Anderson Y. Lee]",,"[Chief Executive Officer, Chief Financial Officer]"
9,0000004977-18-000152,"[Daniel P. Amos, Frederick J. Crawford]","[Chief Executive Officer, Chief Financial Officer]",


## Problems with the results

1. Some results come back empty. A check of the .txt file indicates HTML codes are present which disrupt the regex.
2. Some executives bear the title 'president'

In [141]:
def regex_titles_corrected(filing_text):
    # Soup the entire filing. Strip any extraneous characters 
    # (this significantly improves results, but is slow)
    souped_filing = BeautifulSoup(filing_text).get_text(strip=True)
    
    # Include 'President' in the results
    titles_expression=r'certify that:.+?(Chief .+? Officer|President)'
    matches = re.findall(titles_expression, souped_filing, re.MULTILINE)
    # only the first two matches are relevant
    matches = matches[0:2]
    if matches == []:
        return np.nan
    else:
        return matches

# the above function is effective but slow. Use only on NaN-results of first round.
officers_filter = pd.isnull(officers['titles_firsttry'])

officers['titles_2nd_try'] = listings_first50['content'][officers_filter].apply(regex_titles_corrected)
officers.head(50)

['Chief Financial Officer']
['President', 'Chief Financial Officer']
['Chief Executive Officer', 'Chief Financial Officer']
['Chief Executive Officer', 'Chief Financial Officer']
[]
[]
['President', 'Chief Financial Officer']
['Chief Executive Officer', 'Chief Financial Officer']
['Chief Executive Officer', 'Chief Financial Officer']
[]


Unnamed: 0,adsh,officer_names,titles_firsttry,titles_2nd_try
0,0000002178-18-000067,[Townes G. Pressler],,[Chief Financial Officer]
1,0000002488-18-000189,"[Devinder Kumar, Lisa T. Su]","[Chief Executive Officer, Chief Financial Officer]",
2,0000003499-18-000023,"[Matthew Iocco, Steven Roth]","[Chief Executive Officer, Chief Financial Officer]",
3,0000003570-18-000160,[Michael J. Wortley],"[Chief Executive Officer, Chief Financial Officer]",
4,0000004281-18-000127,"[Ken Giacobbe, Charles P. Blankenship]","[Chief Executive Officer, Chief Financial Officer]",
5,0000004457-18-000054,"[Edward J. Shoen, Jason A. Berg]",,"[President, Chief Financial Officer]"
6,0000004904-18-000055,"[Brian X. Tierney, Nicholas K. Akins]","[Chief Executive Officer, Chief Executive Officer]",
7,0000004962-18-000121,"[Jeffrey C. Campbell, Stephen J. Squeri]",,"[Chief Executive Officer, Chief Financial Officer]"
8,0000004969-18-000024,"[David L. Yowan, Anderson Y. Lee]",,"[Chief Executive Officer, Chief Financial Officer]"
9,0000004977-18-000152,"[Daniel P. Amos, Frederick J. Crawford]","[Chief Executive Officer, Chief Financial Officer]",


In [142]:
sum(pd.isnull(officers['titles_firsttry']) & pd.isnull(officers['titles_2nd_try']))

3

## Results and application to all filings

Now, only 3 out of 50 yield no results with either function. There are other minor issues:  
* 'ChiefFinancial Officer' appears once (missing space).
* Fewer titles than names are found.

However, an interesting heuristic appears: in the overwhelming majority of cases, the CEO/President signed first. Only one filing (out of 47) reversed this.

Now, we will define the master function that will be applied to the full filings dataframe.

In [151]:
def extract_names_titles(filing_text):
    '''
    Reads the 10-K .txt data and returns
    [(Name1, Title1), (Name2, Title2)]
    '''
    valid_titles = ['Chief Executive Officer', 'Chief Financial Officer', 'President']
    
    names = regex_officers_corrected(filing_text)
    print('starting.. ')
    
    try:
        # if the first attempt already comes back as a list of tuples, we're done.
        if all(type(i) is tuple for i in names):
            return names

        # use the fast function for first try
        titles = regex_titles_fast(filing_text)

        # came back empty? Use the slower, more accurate function
        if titles is np.nan:
            titles = regex_titles_corrected(filing_text)

        # still came back empty? Use HEURISTIC: first is CEO, second is CFO.
        if titles is np.nan:
            titles = ('Chief Executive Officer', 'Chief Financial Officer')

        if all(i in valid_titles for i in titles):
            if len(names) is 2:
                return [(names[0], titles[0]), (names[1], titles[1])]
            if len(names) is 1:
                return [(names[0], titles[0])]
    
    except:
        return np.nan

starttime = datetime.datetime.now()    
officers['Final Try'] = listings_first50['content'].apply(extract_names_titles)   
endtime = datetime.datetime.now()
print('processing 50 took: ', endtime-starttime, ' 5000 will take: ', (endtime-starttime)*100)

officers.head(50)

starting.. 
['Chief Financial Officer']
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
['President', 'Chief Financial Officer']
starting.. 
starting.. 
['Chief Executive Officer', 'Chief Financial Officer']
starting.. 
['Chief Executive Officer', 'Chief Financial Officer']
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
[]
starting.. 
starting.. 
starting.. 
starting.. 
['Chief Executive Officer', 'Chief Financial Officer']
starting.. 
starting.. 
starting.. 
starting.. 
['Chief Executive Officer', 'Chief Financial Officer']
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
starting.. 
[]
starting.. 
starting.. 


Unnamed: 0,adsh,officer_names,titles_firsttry,titles_2nd_try,Final Try
0,0000002178-18-000067,[Townes G. Pressler],,[Chief Financial Officer],"[(Townes G. Pressler, Chief Financial Officer)]"
1,0000002488-18-000189,"[Devinder Kumar, Lisa T. Su]","[Chief Executive Officer, Chief Financial Officer]",,"[(Devinder Kumar, Chief Executive Officer), (Lisa T. Su, Chief Financial Officer)]"
2,0000003499-18-000023,"[Matthew Iocco, Steven Roth]","[Chief Executive Officer, Chief Financial Officer]",,"[(Matthew Iocco, Chief Executive Officer), (Steven Roth, Chief Financial Officer)]"
3,0000003570-18-000160,[Michael J. Wortley],"[Chief Executive Officer, Chief Financial Officer]",,"[(Michael J. Wortley, Chief Executive Officer)]"
4,0000004281-18-000127,"[Ken Giacobbe, Charles P. Blankenship]","[Chief Executive Officer, Chief Financial Officer]",,"[(Ken Giacobbe, Chief Executive Officer), (Charles P. Blankenship, Chief Financial Officer)]"
5,0000004457-18-000054,"[Edward J. Shoen, Jason A. Berg]",,"[President, Chief Financial Officer]","[(Edward J. Shoen, President), (Jason A. Berg, Chief Financial Officer)]"
6,0000004904-18-000055,"[Brian X. Tierney, Nicholas K. Akins]","[Chief Executive Officer, Chief Executive Officer]",,"[(Brian X. Tierney, Chief Executive Officer), (Nicholas K. Akins, Chief Executive Officer)]"
7,0000004962-18-000121,"[Jeffrey C. Campbell, Stephen J. Squeri]",,"[Chief Executive Officer, Chief Financial Officer]","[(Jeffrey C. Campbell, Chief Executive Officer), (Stephen J. Squeri, Chief Financial Officer)]"
8,0000004969-18-000024,"[David L. Yowan, Anderson Y. Lee]",,"[Chief Executive Officer, Chief Financial Officer]","[(David L. Yowan, Chief Executive Officer), (Anderson Y. Lee, Chief Financial Officer)]"
9,0000004977-18-000152,"[Daniel P. Amos, Frederick J. Crawford]","[Chief Executive Officer, Chief Financial Officer]",,"[(Daniel P. Amos, Chief Executive Officer), (Frederick J. Crawford, Chief Financial Officer)]"


## Result

The script produces data in the desired format. **Unfortunately, at least one (index 49) is incorrect.** Further review is needed.