# Overview

Based off *IRS 990 e-File Data -- Excise Tax Project (3) -- Schedule J Part (II) -- Combine, Flatten, and Parse Part II.ipynb* 

Read in DF from first notebook:
-*Schedule J (Part II).pkl.gz*

Combine columns then save:
- *Schedule J (Part II) - combined two columns.pkl.gz*

Split by person and save DF:
- *Schedule J Part II (PERSON-LEVEL DF).pkl.gz*
- There are on average 6.03 individuals per Schedule J filing for filings with more than one person
- Overall, there are 3.999 individuals per Schedule J filing

*Note:*
- 26,672 of the filings have no data for Part II. In this notebook I verify three of these filings and then delete all 26,672 rows.

# Overview of Form 990 Compensation Variables, Sections, and Schedules

A review of the 990 shows three main locations for compensation data:

1. 990 (main)
    - Part IV # 23 (checkbox for Part VII)
        - "Did the organization answer “Yes” to Part VII, Section A, line 3, 4, or 5 about compensation of the organization’s current and former officers, directors, trustees, key employees, and highest compensated employees? If “Yes,” complete Schedule J"
    - Part VI, Section B, #15 - review process for compensation
        - "Did the process for determining compensation of the following persons include a review and approval by independent persons, comparability data, and contemporaneous substantiation of the deliberation and decision?"
            - (a) The organization’s CEO, Executive Director, or top management official . . . . . . . . . . . . 
            - (b) Other officers or key employees of the organization 
    - Part VII (entire section)
    - Part IX, #5, 
        - "Compensation of current officers, directors, trustees, and key employees"
    - Part IX, #6,         
        6 - "Compensation not included above, to disqualified persons (as defined under section 4958(f)(1)) and persons described in section 4958(c)(3)(B)"      
2. Schedule O 
    - explanation for compensation review process (see Part VI, Section B, 15)
3. Schedule J
    - compensation details

# Overview
This notebook is where much of the hard work takes place. It took a lot of trial and error to develop the best approach for manipulating the data. As you may recall from prior tutorials, the two key issues with the data were 1) that different years can have different variable names and 2) many of the columns contain 'nested' data, with multiple variables contained in a column. We need to fix this issues. At the same time, we want to have a standardized set of variable names. So, for renaming the variables we will be using the *concordance* file provided by the e-file team.

The basic strategy I take here is to read in the PANDAS dataframe as well as the concordance file, and then work through the nested columns in turn. Many of these operations took 20 minutes or so on my MacBook Pro with 16GB of RAM, so I would not want to run it on a less powerful machine. Because of the length of time for each step I saved a different versions of the dataset at each stage -- this also helps save time in the event a mistake was caused in a step and we need to revert to a prior version.

After all of the 'flattening' and 'combining' and 'renaming', I conducted verifications to make sure we have all of the expected variables.

The final output is a dataset (*all filings - compensation data.pkl*) with 1,338,852 observations (filings) and 36 columns. The columns include 4 filing/organization identifier columns and 32 'compensation' columns from the main 990 form (Schedule J and Schedule O) data are treated in separate notebooks. Specifically, based on our earlier review of the 990 form, these 32 columns contain compensation data from three main locations in the 990 form:

- Part IV # 23 (checkbox for Part VII)
        - "Did the organization answer “Yes” to Part VII, Section A, line 3, 4, or 5 about compensation of the organization’s current and former officers, directors, trustees, key employees, and highest compensated employees? If “Yes,” complete Schedule J"
- Part VI, Section B, #15 - review process for compensation
        - "Did the process for determining compensation of the following persons include a review and approval by independent persons, comparability data, and contemporaneous substantiation of the deliberation and decision?"
            - (a) The organization’s CEO, Executive Director, or top management official . . . . . . . . . . . . 
            - (b) Other officers or key employees of the organization 
- Part VII (entire section)
- Part IX, #5, 
        - "Compensation of current officers, directors, trustees, and key employees"
- Part IX, #6,         
        6 - "Compensation not included above, to disqualified persons (as defined under section 4958(f)(1)) and persons described in section 4958(c)(3)(B)"      

# Load Packages and Set Working Directory

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

In [2]:
print(pd.__version__)

2.2.2


In [3]:
from platform import python_version
print(python_version())

3.10.11


In [4]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 2500)

In [5]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
warnings.simplefilter(action='ignore', category=pd.errors.SettingWithCopyWarning)

#### Set working directory

In [6]:
cd "C:\\Users\\Gregory\\IRS 990 Control Variables\\"

C:\Users\Gregory\IRS 990 Control Variables


# Read PANDAS DF

In [7]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df = pd.read_pickle('Schedule J (Part II).pkl.gz', compression='gzip')
print('# of columns:', len(df.columns))
print('# of observations:', len(df))
df[:1]

Current date and time :  2025-06-26 23:13:03 

# of columns: 3
# of observations: 773114
CPU times: total: 18.9 s
Wall time: 32.4 s


Unnamed: 0,URL,Form990ScheduleJPartII,RltdOrgOfficerTrstKeyEmplGrp
0,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,"[{'NamePerson': 'THOMAS D TURNBULL', 'CompBasedOnRelatedOrgs': '100712', 'BonusRelatedOrgs': '790', 'OtherCompensationRelatedOrgs': '1257', 'DeferredCompRelatedOrgs': '54308', 'NontaxableBenefitsRelatedOrgs': '62342', 'TotalCompensationRelatedOrgs': '219409'}, {'NamePerson': 'RONALD W PATTERSON', 'CompBasedOnRelatedOrgs': '192455', 'BonusRelatedOrgs': '814', 'OtherCompensationRelatedOrgs': '2071', 'DeferredCompRelatedOrgs': '17271', 'NontaxableBenefitsRelatedOrgs': '23201', 'TotalCompensationRelatedOrgs': '235812'}, {'NamePerson': 'ROBIN KELLER', 'CompBasedOnRelatedOrgs': '78098', 'BonusRelatedOrgs': '790', 'OtherCompensationRelatedOrgs': '755', 'DeferredCompRelatedOrgs': '10371', 'NontaxableBenefitsRelatedOrgs': '78922', 'TotalCompensationRelatedOrgs': '168936'}, {'NamePerson': 'PATRICK SHERIDAN', 'CompBasedOnRelatedOrgs': '172631', 'BonusRelatedOrgs': '836', 'OtherCompensationRelatedOrgs': '1195', 'DeferredCompRelatedOrgs': '33690', 'NontaxableBenefitsRelatedOrgs': '206', 'TotalCompensationRelatedOrgs': '208558'}, {'NamePerson': 'MICHAEL W KING', 'CompBasedOnRelatedOrgs': '153811', 'BonusRelatedOrgs': '133', 'OtherCompensationRelatedOrgs': '875', 'DeferredCompRelatedOrgs': '12395', 'NontaxableBenefitsRelatedOrgs': '31037', 'TotalCompensationRelatedOrgs': '198251'}, {'NamePerson': 'DAVID T BOWMAN', 'CompBasedOnRelatedOrgs': '170752', 'BonusRelatedOrgs': '790', 'OtherCompensationRelatedOrgs': '1347', 'NontaxableBenefitsRelatedOrgs': '58951', 'TotalCompensationRelatedOrgs': '231840'}, {'NamePerson': 'CHARLES W GOULD', 'CompBasedOnRelatedOrgs': '200426', 'BonusRelatedOrgs': '656', 'OtherCompensationRelatedOrgs': '99161', 'DeferredCompRelatedOrgs': '19907', 'NontaxableBenefitsRelatedOrgs': '9047', 'TotalCompensationRelatedOrgs': '329197'}]",


In [8]:
df[-1:]

Unnamed: 0,URL,Form990ScheduleJPartII,RltdOrgOfficerTrstKeyEmplGrp
773113,https://s3.amazonaws.com/irs-form-990/202441449349301564_public.xml,,"{'BusinessName': {'BusinessNameLine1Txt': 'Larry L Parker'}, 'TitleTxt': 'CEO', 'BaseCompensationFilingOrgAmt': '126782', 'CompensationBasedOnRltdOrgsAmt': '0', 'BonusFilingOrganizationAmount': '0', 'BonusRelatedOrganizationsAmt': '0', 'OtherCompensationFilingOrgAmt': '0', 'OtherCompensationRltdOrgsAmt': '0', 'DeferredCompensationFlngOrgAmt': '0', 'DeferredCompRltdOrgsAmt': '0', 'NontaxableBenefitsFilingOrgAmt': '42528', 'NontaxableBenefitsRltdOrgsAmt': '0', 'TotalCompensationFilingOrgAmt': '169310', 'TotalCompensationRltdOrgsAmt': '0', 'CompReportPrior990FilingOrgAmt': '0', 'CompReportPrior990RltdOrgsAmt': '0'}"


In [9]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
print(len(df[df['URL'].isnull()]))
print(len(df[(df['Form990ScheduleJPartII'].isnull())]))
print(len(df[(df['RltdOrgOfficerTrstKeyEmplGrp'].isnull())]))

Current date and time :  2025-06-26 23:14:37 

0
652224
150319
CPU times: total: 266 ms
Wall time: 1.12 s


# Combine columns

### Define Function to combine columns
In Python we can create a series of functions that can be used as shortcuts. First we'll create a function called 'combine_dict' that will combine two variables that contain nested data. It takes as *inputs* four things: our dataset/dataframe (*df*), the name we'd like for our new variable (*newvar*), the name of the first variable to combine (*var1*), and the name of the second variable to combine (*var2*).

In [10]:
def combine_dict(df, newvar, var1, var2):
    df[newvar] = np.where(df[var1].notnull(), df[var1], df[var2])
    #print(df[newvar].value_counts()[:5], '\n')
    print('# of missing observations:', len(df[df[newvar].isnull()]))
    print('# of valid observations:', len(df[df[newvar].notnull()]), '\n')
    #return df[[newvar, var1, var2, 'DLN']][:5]

In [11]:
print(len(df[df['Form990ScheduleJPartII'].isnull()]))
print(len(df[df['Form990ScheduleJPartII'].notnull()]))

652224
120890


In [12]:
print(len(df[df['RltdOrgOfficerTrstKeyEmplGrp'].isnull()]))
print(len(df[df['RltdOrgOfficerTrstKeyEmplGrp'].notnull()]))

150319
622795


In [13]:
print(len(df[(df['Form990ScheduleJPartII'].isnull()) & (df['RltdOrgOfficerTrstKeyEmplGrp'].isnull())]))

29429


<br>There are no rows that have both variables. So, we're safe combining them.

In [14]:
print(len(df[(df['Form990ScheduleJPartII'].notnull()) & (df['RltdOrgOfficerTrstKeyEmplGrp'].notnull())]))

0


In [15]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
combine_dict(df, 'Form990ScheduleJPartII', 'Form990ScheduleJPartII', 'RltdOrgOfficerTrstKeyEmplGrp')

Current date and time :  2025-06-26 23:16:28 

# of missing observations: 29429
# of valid observations: 743685 

CPU times: total: 422 ms
Wall time: 897 ms


In [17]:
df.sample(5)[['Form990ScheduleJPartII', 'RltdOrgOfficerTrstKeyEmplGrp']]

Unnamed: 0,Form990ScheduleJPartII,RltdOrgOfficerTrstKeyEmplGrp
242422,"{'PersonNm': 'JACKIE BARGER', 'TitleTxt': 'EXECUTIVE DIRECTOR', 'BaseCompensationFilingOrgAmt': '67459', 'NontaxableBenefitsFilingOrgAmt': '7599', 'TotalCompensationFilingOrgAmt': '75058'}","{'PersonNm': 'JACKIE BARGER', 'TitleTxt': 'EXECUTIVE DIRECTOR', 'BaseCompensationFilingOrgAmt': '67459', 'NontaxableBenefitsFilingOrgAmt': '7599', 'TotalCompensationFilingOrgAmt': '75058'}"
19606,"[{'NamePerson': 'Henderschedt Robert R', 'BaseCompensationFilingOrg': '573822', 'CompBasedOnRelatedOrgs': '0', 'BonusFilingOrg': '166156', 'BonusRelatedOrgs': '0', 'OtherCompensationFilingOrg': '221069', 'OtherCompensationRelatedOrgs': '0', 'DeferredCompFilingOrg': '103853', 'DeferredCompRelatedOrgs': '0', 'NontaxableBenefitsFilingOrg': '27815', 'NontaxableBenefitsRelatedOrgs': '0', 'TotalCompensationFilingOrg': '1092715', 'TotalCompensationRelatedOrgs': '0', 'CompReportPrior990FilingOrg': '0', 'CompReportPrior990RelatedOrgs': '0'}, {'NamePerson': 'Houmann Lars', 'BaseCompensationFilingOrg': '805019', 'CompBasedOnRelatedOrgs': '0', 'BonusFilingOrg': '232797', 'BonusRelatedOrgs': '0', 'OtherCompensationFilingOrg': '1670573', 'OtherCompensationRelatedOrgs': '3091', 'DeferredCompFilingOrg': '171790', 'DeferredCompRelatedOrgs': '0', 'NontaxableBenefitsFilingOrg': '42086', 'NontaxableBenefitsRelatedOrgs': '0', 'TotalCompensationFilingOrg': '2922265', 'TotalCompensationRelatedOrgs': '3091', 'CompReportPrior990FilingOrg': '0', 'CompReportPrior990RelatedOrgs': '0'}, {'NamePerson': 'Jernigan PhD Donald L', 'BaseCompensationFilingOrg': '903442', 'CompBasedOnRelatedOrgs': '0', 'BonusFilingOrg': '261600', 'BonusRelatedOrgs': '0', 'OtherCompensationFilingOrg': '501349', 'OtherCompensationRelatedOrgs': '0', 'DeferredCompFilingOrg': '186905', 'DeferredCompRelatedOrgs': '0', 'NontaxableBenefitsFilingOrg': '100232', 'NontaxableBenefitsRelatedOrgs': '0', 'TotalCompensationFilingOrg': '1953528', 'TotalCompensationRelatedOrgs': '0', 'CompReportPrior990FilingOrg': '186046', 'CompReportPrior990RelatedOrgs': '0'}, {'NamePerson': 'Johnson Sandra K', 'BaseCompensationFilingOrg': '370185', 'CompBasedOnRelatedOrgs': '0', 'BonusFilingOrg': '89695', 'BonusRelatedOrgs': '0', 'OtherCompensationFilingOrg': '537495', 'OtherCompensationRelatedOrgs': '0', 'DeferredCompFilingOrg': '43106', 'DeferredCompRelatedOrgs': '0', 'NontaxableBenefitsFilingOrg': '38234', 'NontaxableBenefitsRelatedOrgs': '0', 'TotalCompensationFilingOrg': '1078715', 'TotalCompensationRelatedOrgs': '0', 'CompReportPrior990FilingOrg': '0', 'CompReportPrior990RelatedOrgs': '0'}, {'NamePerson': 'Reed MD Monica P', 'BaseCompensationFilingOrg': '394740', 'CompBasedOnRelatedOrgs': '0', 'BonusFilingOrg': '92617', 'BonusRelatedOrgs': '0', 'OtherCompensationFilingOrg': '15471', 'OtherCompensationRelatedOrgs': '1434', 'DeferredCompFilingOrg': '68429', 'DeferredCompRelatedOrgs': '0', 'NontaxableBenefitsFilingOrg': '48530', 'Non...",
327639,"[{'PersonNm': 'Alexa Sewell', 'TitleTxt': 'Chairman', 'CompensationBasedOnRltdOrgsAmt': '245665', 'NontaxableBenefitsRltdOrgsAmt': '39311', 'TotalCompensationRltdOrgsAmt': '284976'}, {'PersonNm': 'Lee Warshavsky', 'TitleTxt': 'Treasurer', 'CompensationBasedOnRltdOrgsAmt': '142564', 'NontaxableBenefitsRltdOrgsAmt': '9948', 'TotalCompensationRltdOrgsAmt': '152512'}]","[{'PersonNm': 'Alexa Sewell', 'TitleTxt': 'Chairman', 'CompensationBasedOnRltdOrgsAmt': '245665', 'NontaxableBenefitsRltdOrgsAmt': '39311', 'TotalCompensationRltdOrgsAmt': '284976'}, {'PersonNm': 'Lee Warshavsky', 'TitleTxt': 'Treasurer', 'CompensationBasedOnRltdOrgsAmt': '142564', 'NontaxableBenefitsRltdOrgsAmt': '9948', 'TotalCompensationRltdOrgsAmt': '152512'}]"
581163,"{'PersonNm': 'Milan Yager', 'BaseCompensationFilingOrgAmt': '244848', 'CompensationBasedOnRltdOrgsAmt': '0', 'BonusFilingOrganizationAmount': '19650', 'BonusRelatedOrganizationsAmt': '0', 'OtherCompensationFilingOrgAmt': '0', 'OtherCompensationRltdOrgsAmt': '0', 'DeferredCompensationFlngOrgAmt': '7755', 'DeferredCompRltdOrgsAmt': '0', 'NontaxableBenefitsFilingOrgAmt': '28880', 'NontaxableBenefitsRltdOrgsAmt': '0', 'TotalCompensationFilingOrgAmt': '301133', 'TotalCompensationRltdOrgsAmt': '0', 'CompReportPrior990FilingOrgAmt': '0', 'CompReportPrior990RltdOrgsAmt': '0'}","{'PersonNm': 'Milan Yager', 'BaseCompensationFilingOrgAmt': '244848', 'CompensationBasedOnRltdOrgsAmt': '0', 'BonusFilingOrganizationAmount': '19650', 'BonusRelatedOrganizationsAmt': '0', 'OtherCompensationFilingOrgAmt': '0', 'OtherCompensationRltdOrgsAmt': '0', 'DeferredCompensationFlngOrgAmt': '7755', 'DeferredCompRltdOrgsAmt': '0', 'NontaxableBenefitsFilingOrgAmt': '28880', 'NontaxableBenefitsRltdOrgsAmt': '0', 'TotalCompensationFilingOrgAmt': '301133', 'TotalCompensationRltdOrgsAmt': '0', 'CompReportPrior990FilingOrgAmt': '0', 'CompReportPrior990RltdOrgsAmt': '0'}"
481122,"[{'PersonNm': 'ROBIN SOMERS', 'TitleTxt': 'PRESIDENT/CEO AS OF FEB 2020, COO', 'BaseCompensationFilingOrgAmt': '0', 'CompensationBasedOnRltdOrgsAmt': '223310', 'BonusFilingOrganizationAmount': '0', 'BonusRelatedOrganizationsAmt': '7086', 'OtherCompensationFilingOrgAmt': '0', 'OtherCompensationRltdOrgsAmt': '10636', 'DeferredCompensationFlngOrgAmt': '0', 'DeferredCompRltdOrgsAmt': '6891', 'NontaxableBenefitsFilingOrgAmt': '0', 'NontaxableBenefitsRltdOrgsAmt': '35520', 'TotalCompensationFilingOrgAmt': '0', 'TotalCompensationRltdOrgsAmt': '283443', 'CompReportPrior990FilingOrgAmt': '0', 'CompReportPrior990RltdOrgsAmt': '0'}, {'PersonNm': 'JOHN J PALKOVITZ JR', 'TitleTxt': 'TREASURER/CFO', 'BaseCompensationFilingOrgAmt': '0', 'CompensationBasedOnRltdOrgsAmt': '230853', 'BonusFilingOrganizationAmount': '0', 'BonusRelatedOrganizationsAmt': '7086', 'OtherCompensationFilingOrgAmt': '0', 'OtherCompensationRltdOrgsAmt': '12391', 'DeferredCompensationFlngOrgAmt': '0', 'DeferredCompRltdOrgsAmt': '8016', 'NontaxableBenefitsFilingOrgAmt': '0', 'NontaxableBenefitsRltdOrgsAmt': '14046', 'TotalCompensationFilingOrgAmt': '0', 'TotalCompensationRltdOrgsAmt': '272392', 'CompReportPrior990FilingOrgAmt': '0', 'CompReportPrior990RltdOrgsAmt': '0'}, {'PersonNm': 'JOHN HOWL', 'TitleTxt': 'PRESIDENT/CEO TO FEB 2020', 'BaseCompensationFilingOrgAmt': '0', 'CompensationBasedOnRltdOrgsAmt': '264523', 'BonusFilingOrganizationAmount': '0', 'BonusRelatedOrganizationsAmt': '20000', 'OtherCompensationFilingOrgAmt': '0', 'OtherCompensationRltdOrgsAmt': '54370', 'DeferredCompensationFlngOrgAmt': '0', 'DeferredCompRltdOrgsAmt': '11428', 'NontaxableBenefitsFilingOrgAmt': '0', 'NontaxableBenefitsRltdOrgsAmt': '14142', 'TotalCompensationFilingOrgAmt': '0', 'TotalCompensationRltdOrgsAmt': '364463', 'CompReportPrior990FilingOrgAmt': '0', 'CompReportPrior990RltdOrgsAmt': '0'}]","[{'PersonNm': 'ROBIN SOMERS', 'TitleTxt': 'PRESIDENT/CEO AS OF FEB 2020, COO', 'BaseCompensationFilingOrgAmt': '0', 'CompensationBasedOnRltdOrgsAmt': '223310', 'BonusFilingOrganizationAmount': '0', 'BonusRelatedOrganizationsAmt': '7086', 'OtherCompensationFilingOrgAmt': '0', 'OtherCompensationRltdOrgsAmt': '10636', 'DeferredCompensationFlngOrgAmt': '0', 'DeferredCompRltdOrgsAmt': '6891', 'NontaxableBenefitsFilingOrgAmt': '0', 'NontaxableBenefitsRltdOrgsAmt': '35520', 'TotalCompensationFilingOrgAmt': '0', 'TotalCompensationRltdOrgsAmt': '283443', 'CompReportPrior990FilingOrgAmt': '0', 'CompReportPrior990RltdOrgsAmt': '0'}, {'PersonNm': 'JOHN J PALKOVITZ JR', 'TitleTxt': 'TREASURER/CFO', 'BaseCompensationFilingOrgAmt': '0', 'CompensationBasedOnRltdOrgsAmt': '230853', 'BonusFilingOrganizationAmount': '0', 'BonusRelatedOrganizationsAmt': '7086', 'OtherCompensationFilingOrgAmt': '0', 'OtherCompensationRltdOrgsAmt': '12391', 'DeferredCompensationFlngOrgAmt': '0', 'DeferredCompRltdOrgsAmt': '8016', 'NontaxableBenefitsFilingOrgAmt': '0', 'NontaxableBenefitsRltdOrgsAmt': '14046', 'TotalCompensationFilingOrgAmt': '0', 'TotalCompensationRltdOrgsAmt': '272392', 'CompReportPrior990FilingOrgAmt': '0', 'CompReportPrior990RltdOrgsAmt': '0'}, {'PersonNm': 'JOHN HOWL', 'TitleTxt': 'PRESIDENT/CEO TO FEB 2020', 'BaseCompensationFilingOrgAmt': '0', 'CompensationBasedOnRltdOrgsAmt': '264523', 'BonusFilingOrganizationAmount': '0', 'BonusRelatedOrganizationsAmt': '20000', 'OtherCompensationFilingOrgAmt': '0', 'OtherCompensationRltdOrgsAmt': '54370', 'DeferredCompensationFlngOrgAmt': '0', 'DeferredCompRltdOrgsAmt': '11428', 'NontaxableBenefitsFilingOrgAmt': '0', 'NontaxableBenefitsRltdOrgsAmt': '14142', 'TotalCompensationFilingOrgAmt': '0', 'TotalCompensationRltdOrgsAmt': '364463', 'CompReportPrior990FilingOrgAmt': '0', 'CompReportPrior990RltdOrgsAmt': '0'}]"


In [18]:
print(len(df[df['Form990ScheduleJPartII'].isnull()]))
print(len(df[df['Form990ScheduleJPartII'].notnull()]))

29429
743685


<br>Note: In the original notebook I checked samples of the rows with no Part II data. 

##### Delete empty rows

<br>
412,546 is the correct number of rows for 'RltdOrgOfficerTrstKeyEmplGrp' plus 'Form990ScheduleJPartII'

In [19]:
print(len(df[df['Form990ScheduleJPartII'].isnull()]))
print(len(df[df['Form990ScheduleJPartII'].notnull()]))

29429
743685


In [20]:
print(len(df))
print(len(df[df['Form990ScheduleJPartII'].isnull()]))
print(len(df[df['Form990ScheduleJPartII'].notnull()]))
df = df[df['Form990ScheduleJPartII'].notnull()]
print(len(df))
print(len(df[df['Form990ScheduleJPartII'].isnull()]))
print(len(df[df['Form990ScheduleJPartII'].notnull()]))
df[:1]

773114
29429
743685
743685
0
743685


Unnamed: 0,URL,Form990ScheduleJPartII,RltdOrgOfficerTrstKeyEmplGrp
0,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,"[{'NamePerson': 'THOMAS D TURNBULL', 'CompBasedOnRelatedOrgs': '100712', 'BonusRelatedOrgs': '790', 'OtherCompensationRelatedOrgs': '1257', 'DeferredCompRelatedOrgs': '54308', 'NontaxableBenefitsRelatedOrgs': '62342', 'TotalCompensationRelatedOrgs': '219409'}, {'NamePerson': 'RONALD W PATTERSON', 'CompBasedOnRelatedOrgs': '192455', 'BonusRelatedOrgs': '814', 'OtherCompensationRelatedOrgs': '2071', 'DeferredCompRelatedOrgs': '17271', 'NontaxableBenefitsRelatedOrgs': '23201', 'TotalCompensationRelatedOrgs': '235812'}, {'NamePerson': 'ROBIN KELLER', 'CompBasedOnRelatedOrgs': '78098', 'BonusRelatedOrgs': '790', 'OtherCompensationRelatedOrgs': '755', 'DeferredCompRelatedOrgs': '10371', 'NontaxableBenefitsRelatedOrgs': '78922', 'TotalCompensationRelatedOrgs': '168936'}, {'NamePerson': 'PATRICK SHERIDAN', 'CompBasedOnRelatedOrgs': '172631', 'BonusRelatedOrgs': '836', 'OtherCompensationRelatedOrgs': '1195', 'DeferredCompRelatedOrgs': '33690', 'NontaxableBenefitsRelatedOrgs': '206', 'TotalCompensationRelatedOrgs': '208558'}, {'NamePerson': 'MICHAEL W KING', 'CompBasedOnRelatedOrgs': '153811', 'BonusRelatedOrgs': '133', 'OtherCompensationRelatedOrgs': '875', 'DeferredCompRelatedOrgs': '12395', 'NontaxableBenefitsRelatedOrgs': '31037', 'TotalCompensationRelatedOrgs': '198251'}, {'NamePerson': 'DAVID T BOWMAN', 'CompBasedOnRelatedOrgs': '170752', 'BonusRelatedOrgs': '790', 'OtherCompensationRelatedOrgs': '1347', 'NontaxableBenefitsRelatedOrgs': '58951', 'TotalCompensationRelatedOrgs': '231840'}, {'NamePerson': 'CHARLES W GOULD', 'CompBasedOnRelatedOrgs': '200426', 'BonusRelatedOrgs': '656', 'OtherCompensationRelatedOrgs': '99161', 'DeferredCompRelatedOrgs': '19907', 'NontaxableBenefitsRelatedOrgs': '9047', 'TotalCompensationRelatedOrgs': '329197'}]",


### Drop Extra Columns

In [21]:
df = df.drop('RltdOrgOfficerTrstKeyEmplGrp', axis=1)
df[:1]

Unnamed: 0,URL,Form990ScheduleJPartII
0,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,"[{'NamePerson': 'THOMAS D TURNBULL', 'CompBasedOnRelatedOrgs': '100712', 'BonusRelatedOrgs': '790', 'OtherCompensationRelatedOrgs': '1257', 'DeferredCompRelatedOrgs': '54308', 'NontaxableBenefitsRelatedOrgs': '62342', 'TotalCompensationRelatedOrgs': '219409'}, {'NamePerson': 'RONALD W PATTERSON', 'CompBasedOnRelatedOrgs': '192455', 'BonusRelatedOrgs': '814', 'OtherCompensationRelatedOrgs': '2071', 'DeferredCompRelatedOrgs': '17271', 'NontaxableBenefitsRelatedOrgs': '23201', 'TotalCompensationRelatedOrgs': '235812'}, {'NamePerson': 'ROBIN KELLER', 'CompBasedOnRelatedOrgs': '78098', 'BonusRelatedOrgs': '790', 'OtherCompensationRelatedOrgs': '755', 'DeferredCompRelatedOrgs': '10371', 'NontaxableBenefitsRelatedOrgs': '78922', 'TotalCompensationRelatedOrgs': '168936'}, {'NamePerson': 'PATRICK SHERIDAN', 'CompBasedOnRelatedOrgs': '172631', 'BonusRelatedOrgs': '836', 'OtherCompensationRelatedOrgs': '1195', 'DeferredCompRelatedOrgs': '33690', 'NontaxableBenefitsRelatedOrgs': '206', 'TotalCompensationRelatedOrgs': '208558'}, {'NamePerson': 'MICHAEL W KING', 'CompBasedOnRelatedOrgs': '153811', 'BonusRelatedOrgs': '133', 'OtherCompensationRelatedOrgs': '875', 'DeferredCompRelatedOrgs': '12395', 'NontaxableBenefitsRelatedOrgs': '31037', 'TotalCompensationRelatedOrgs': '198251'}, {'NamePerson': 'DAVID T BOWMAN', 'CompBasedOnRelatedOrgs': '170752', 'BonusRelatedOrgs': '790', 'OtherCompensationRelatedOrgs': '1347', 'NontaxableBenefitsRelatedOrgs': '58951', 'TotalCompensationRelatedOrgs': '231840'}, {'NamePerson': 'CHARLES W GOULD', 'CompBasedOnRelatedOrgs': '200426', 'BonusRelatedOrgs': '656', 'OtherCompensationRelatedOrgs': '99161', 'DeferredCompRelatedOrgs': '19907', 'NontaxableBenefitsRelatedOrgs': '9047', 'TotalCompensationRelatedOrgs': '329197'}]"


#### Save DF

In [22]:
print('# of columns:', len(df.columns))
print('# of observations:', len(df))

# of columns: 2
# of observations: 743685


In [23]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df.to_pickle('Schedule J (Part II) - combined two columns.pkl.gz', compression='gzip')

Current date and time :  2025-06-26 23:17:21 

CPU times: total: 1min 20s
Wall time: 1min 25s


#### Quick verifications 

In [24]:
df[df['URL'].isnull()]

Unnamed: 0,URL,Form990ScheduleJPartII


# Split column by person
First, we write create a new variable, *part_2_len* that captures the 'length' of the *Form990ScheduleJPartII* variable. If the value is a *list*, then the new value will be the number of persons entered in Part II. If the value is a dictionary, in contrast, this means there is only 1 entry. This new variable will not only serve to differentiate the *list* values from *dictionary* values, but it will help us in the verifying the data after splitting into separate rows.

In [25]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df['part_2_len'] = df['Form990ScheduleJPartII'].apply(lambda x: len(x) if type(x) is list else np.nan)
print(len(df[df['part_2_len'].isnull()]))
print(len(df[df['part_2_len'].notnull()]), '\n')
print(df['part_2_len'].value_counts().head(), '\n')
print(df['part_2_len'].describe().T, '\n')

Current date and time :  2025-06-27 12:23:37 

299590
444095 

part_2_len
2.0    119929
3.0     72511
4.0     47849
5.0     35888
6.0     29409
Name: count, dtype: int64 

count    444095.000000
mean          6.017798
std           5.886455
min           2.000000
25%           2.000000
50%           4.000000
75%           8.000000
max         316.000000
Name: part_2_len, dtype: float64 

CPU times: total: 438 ms
Wall time: 922 ms


In [26]:
print(len(df[df['part_2_len'].isnull()]))
df[df['part_2_len'].isnull()].sample(3)

299590


Unnamed: 0,URL,Form990ScheduleJPartII,part_2_len
418853,https://s3.amazonaws.com/irs-form-990/201943169349304849_public.xml,"{'PersonNm': 'RANDAL WARD', 'TitleTxt': 'FUND ADMINISTRATOR', 'CompensationBasedOnRltdOrgsAmt': '375784', 'DeferredCompRltdOrgsAmt': '27500', 'TotalCompensationRltdOrgsAmt': '403284'}",
328640,https://s3.amazonaws.com/irs-form-990/201801669349301355_public.xml,"{'PersonNm': 'Nora Garcia', 'TitleTxt': 'EXECUTIVE DIRECTOR', 'BaseCompensationFilingOrgAmt': '181946', 'DeferredCompensationFlngOrgAmt': '5067', 'NontaxableBenefitsFilingOrgAmt': '6935', 'TotalCompensationFilingOrgAmt': '193948'}",
217732,https://s3.amazonaws.com/irs-form-990/201620919349300032_public.xml,"{'PersonNm': 'GREGORY MCTAGGART', 'TitleTxt': 'President', 'BaseCompensationFilingOrgAmt': '295378', 'NontaxableBenefitsFilingOrgAmt': '13808', 'TotalCompensationFilingOrgAmt': '309186'}",


In [27]:
print(len(df[df['part_2_len'].isnull()]))
print(len(df[df['part_2_len'].notnull()]))

299590
444095


### Create version of dataset with only multiple people per filing 
This is the one we will split. Then we'll append the single-person dataset to the split file.

In [28]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
print(len(df[df['part_2_len'].isnull()]))
print(len(df[df['part_2_len'].notnull()]), '\n')
df2 = df[df['part_2_len'].notnull()].copy()
print(len(df2[df2['part_2_len'].isnull()]))
print(len(df2[df2['part_2_len'].notnull()]))
print(len(df2))

Current date and time :  2025-06-27 12:24:44 

299590
444095 

0
444095
444095
CPU times: total: 375 ms
Wall time: 388 ms


In [30]:
import numpy as np
from itertools import chain

# return list from series of comma-separated strings
def chainer(s):
    #return list(chain.from_iterable(s.str.split(',')))
    #return list(chain.from_iterable(s.values.tolist()) #ALSO WORKS
    return list(chain.from_iterable(s))

In [31]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
# calculate lengths of splits
lens = df2['part_2_len']#.str.split(',').map(len)
# create new dataframe, repeating or chaining as appropriate
res = pd.DataFrame({'URL': np.repeat(df2['URL'], lens),
                    'Form990ScheduleJPartII': chainer(df2['Form990ScheduleJPartII'])})
print(len(res))
res[-10:]

Current date and time :  2025-06-27 12:25:42 

2672474
CPU times: total: 1.38 s
Wall time: 1.41 s


Unnamed: 0,URL,Form990ScheduleJPartII
773111,https://s3.amazonaws.com/irs-form-990/202441449349301404_public.xml,"{'PersonNm': 'Thomas A Farrington', 'BaseCompensationFilingOrgAmt': '474600', 'CompensationBasedOnRltdOrgsAmt': '0', 'BonusFilingOrganizationAmount': '0', 'BonusRelatedOrganizationsAmt': '0', 'OtherCompensationFilingOrgAmt': '0', 'OtherCompensationRltdOrgsAmt': '0', 'DeferredCompensationFlngOrgAmt': '0', 'DeferredCompRltdOrgsAmt': '0', 'NontaxableBenefitsFilingOrgAmt': '0', 'NontaxableBenefitsRltdOrgsAmt': '0', 'TotalCompensationFilingOrgAmt': '474600', 'TotalCompensationRltdOrgsAmt': '0', 'CompReportPrior990FilingOrgAmt': '0', 'CompReportPrior990RltdOrgsAmt': '0'}"
773111,https://s3.amazonaws.com/irs-form-990/202441449349301404_public.xml,"{'PersonNm': 'Juarez Farrington', 'BaseCompensationFilingOrgAmt': '210010', 'CompensationBasedOnRltdOrgsAmt': '0', 'BonusFilingOrganizationAmount': '0', 'BonusRelatedOrganizationsAmt': '0', 'OtherCompensationFilingOrgAmt': '0', 'OtherCompensationRltdOrgsAmt': '0', 'DeferredCompensationFlngOrgAmt': '0', 'DeferredCompRltdOrgsAmt': '0', 'NontaxableBenefitsFilingOrgAmt': '0', 'NontaxableBenefitsRltdOrgsAmt': '0', 'TotalCompensationFilingOrgAmt': '210010', 'TotalCompensationRltdOrgsAmt': '0', 'CompReportPrior990FilingOrgAmt': '0', 'CompReportPrior990RltdOrgsAmt': '0'}"
773112,https://s3.amazonaws.com/irs-form-990/202441449349301409_public.xml,"{'PersonNm': 'BRYAN GOODWIN', 'TitleTxt': 'CEO & PRESIDENT', 'BaseCompensationFilingOrgAmt': '267586', 'DeferredCompensationFlngOrgAmt': '27091', 'NontaxableBenefitsFilingOrgAmt': '32850', 'TotalCompensationFilingOrgAmt': '327527'}"
773112,https://s3.amazonaws.com/irs-form-990/202441449349301409_public.xml,"{'PersonNm': 'RONALD MILETTA', 'TitleTxt': 'CHIEF MKTG & INNOVTN', 'BaseCompensationFilingOrgAmt': '198799', 'DeferredCompensationFlngOrgAmt': '19926', 'NontaxableBenefitsFilingOrgAmt': '11493', 'TotalCompensationFilingOrgAmt': '230218'}"
773112,https://s3.amazonaws.com/irs-form-990/202441449349301409_public.xml,"{'PersonNm': 'SUE DESCH', 'TitleTxt': 'CFO', 'BaseCompensationFilingOrgAmt': '180137', 'DeferredCompensationFlngOrgAmt': '18060', 'NontaxableBenefitsFilingOrgAmt': '11427', 'TotalCompensationFilingOrgAmt': '209624'}"
773112,https://s3.amazonaws.com/irs-form-990/202441449349301409_public.xml,"{'PersonNm': 'DALE LEWIS', 'TitleTxt': 'VP LARGE PROGRAMS', 'BaseCompensationFilingOrgAmt': '177982', 'DeferredCompensationFlngOrgAmt': '18091', 'NontaxableBenefitsFilingOrgAmt': '21325', 'TotalCompensationFilingOrgAmt': '217398'}"
773112,https://s3.amazonaws.com/irs-form-990/202441449349301409_public.xml,"{'PersonNm': 'SHEILA ARENS OLENE', 'TitleTxt': 'ED RESEARCH & EVAL', 'BaseCompensationFilingOrgAmt': '164866', 'DeferredCompensationFlngOrgAmt': '16654', 'NontaxableBenefitsFilingOrgAmt': '19525', 'TotalCompensationFilingOrgAmt': '201045'}"
773112,https://s3.amazonaws.com/irs-form-990/202441449349301409_public.xml,"{'PersonNm': 'ELIZABETH WATSON', 'TitleTxt': 'SNR DIR OF SALES', 'BaseCompensationFilingOrgAmt': '161500', 'DeferredCompensationFlngOrgAmt': '16459', 'NontaxableBenefitsFilingOrgAmt': '29901', 'TotalCompensationFilingOrgAmt': '207860'}"
773112,https://s3.amazonaws.com/irs-form-990/202441449349301409_public.xml,"{'PersonNm': 'CHRISTINA TYDEMAN', 'TitleTxt': 'EXEC. PROGRAM DIR.', 'BaseCompensationFilingOrgAmt': '161116', 'DeferredCompensationFlngOrgAmt': '16128', 'NontaxableBenefitsFilingOrgAmt': '7637', 'TotalCompensationFilingOrgAmt': '184881'}"
773112,https://s3.amazonaws.com/irs-form-990/202441449349301409_public.xml,"{'PersonNm': 'KRIS ROULEAU', 'TitleTxt': 'VP LEARNING SERVICES', 'BaseCompensationFilingOrgAmt': '159957', 'DeferredCompensationFlngOrgAmt': '16338', 'NontaxableBenefitsFilingOrgAmt': '31839', 'TotalCompensationFilingOrgAmt': '208134'}"


#### Save DF

In [32]:
len(res)

2672474

In [33]:
print(len(set(res['URL'].tolist())))

444095


In [34]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
res.to_pickle('Schedule J Part II (PERSON-LEVEL DF).pkl.gz', compression='gzip')

Current date and time :  2025-06-27 12:25:51 

CPU times: total: 1min 15s
Wall time: 1min 18s


# Append Files
*df3* are filings with only one person in *Form990ScheduleJPartII*

In [35]:
df3 = df[df['part_2_len'].isnull()].copy()
print(len(df3[df3['part_2_len'].notnull()]))
print(len(df3[df3['part_2_len'].isnull()]))
print(len(df3))

0
299590
299590


In [36]:
df3[:1]

Unnamed: 0,URL,Form990ScheduleJPartII,part_2_len
7,https://s3.amazonaws.com/irs-form-990/201113149349301511_public.xml,"{'NamePerson': 'WALLACE DAVIS', 'BaseCompensationFilingOrg': '0', 'CompBasedOnRelatedOrgs': '207852', 'BonusFilingOrg': '0', 'BonusRelatedOrgs': '0', 'OtherCompensationFilingOrg': '0', 'OtherCompensationRelatedOrgs': '11400', 'DeferredCompFilingOrg': '0', 'DeferredCompRelatedOrgs': '0', 'NontaxableBenefitsFilingOrg': '0', 'NontaxableBenefitsRelatedOrgs': '40043', 'TotalCompensationFilingOrg': '0', 'TotalCompensationRelatedOrgs': '259295', 'CompReportPrior990FilingOrg': '0', 'CompReportPrior990RelatedOrgs': '0'}",


In [37]:
df3 = df3[['URL', 'Form990ScheduleJPartII']]
df3[:1]

Unnamed: 0,URL,Form990ScheduleJPartII
7,https://s3.amazonaws.com/irs-form-990/201113149349301511_public.xml,"{'NamePerson': 'WALLACE DAVIS', 'BaseCompensationFilingOrg': '0', 'CompBasedOnRelatedOrgs': '207852', 'BonusFilingOrg': '0', 'BonusRelatedOrgs': '0', 'OtherCompensationFilingOrg': '0', 'OtherCompensationRelatedOrgs': '11400', 'DeferredCompFilingOrg': '0', 'DeferredCompRelatedOrgs': '0', 'NontaxableBenefitsFilingOrg': '0', 'NontaxableBenefitsRelatedOrgs': '40043', 'TotalCompensationFilingOrg': '0', 'TotalCompensationRelatedOrgs': '259295', 'CompReportPrior990FilingOrg': '0', 'CompReportPrior990RelatedOrgs': '0'}"


In [45]:
#%%time
#import datetime
#print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
#df3.to_pickle('df3.pkl.gz', compression='gzip')

Current date and time :  2022-02-23 22:39:23 

Wall time: 6.34 s


In [48]:
print(len(df2), len(df3))
print(len(df2)+len(df3))
print(len(df))

444095 299590
743685
743685


In [39]:
res[:1]

Unnamed: 0,URL,Form990ScheduleJPartII
0,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,"{'NamePerson': 'THOMAS D TURNBULL', 'CompBasedOnRelatedOrgs': '100712', 'BonusRelatedOrgs': '790', 'OtherCompensationRelatedOrgs': '1257', 'DeferredCompRelatedOrgs': '54308', 'NontaxableBenefitsRelatedOrgs': '62342', 'TotalCompensationRelatedOrgs': '219409'}"


In [40]:
len(res) + len(df3)

2972064

In [41]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
#persondf = res.append(df3)
persondf = pd.concat([res, df3], ignore_index=True)
print(len(persondf))
persondf[-1:]

Current date and time :  2025-06-27 12:27:44 

2972064
CPU times: total: 109 ms
Wall time: 484 ms


Unnamed: 0,URL,Form990ScheduleJPartII
2972063,https://s3.amazonaws.com/irs-form-990/202441449349301564_public.xml,"{'BusinessName': {'BusinessNameLine1Txt': 'Larry L Parker'}, 'TitleTxt': 'CEO', 'BaseCompensationFilingOrgAmt': '126782', 'CompensationBasedOnRltdOrgsAmt': '0', 'BonusFilingOrganizationAmount': '0', 'BonusRelatedOrganizationsAmt': '0', 'OtherCompensationFilingOrgAmt': '0', 'OtherCompensationRltdOrgsAmt': '0', 'DeferredCompensationFlngOrgAmt': '0', 'DeferredCompRltdOrgsAmt': '0', 'NontaxableBenefitsFilingOrgAmt': '42528', 'NontaxableBenefitsRltdOrgsAmt': '0', 'TotalCompensationFilingOrgAmt': '169310', 'TotalCompensationRltdOrgsAmt': '0', 'CompReportPrior990FilingOrgAmt': '0', 'CompReportPrior990RltdOrgsAmt': '0'}"


In [42]:
print(len(df['URL'].tolist()))
print(len(set(df['URL'].tolist())))

743685
743685


In [43]:
print(len(persondf['URL'].tolist()))
print(len(set(persondf['URL'].tolist())))

2972064
743685


<br>Number of individuals per filing -- first for the subset of filings with multiple individuals then for all Schedule J filings.

In [49]:
len(res)/444095

6.017797993672525

In [45]:
print(len(persondf)/len(df))

3.996401702333649


#### Save DF

In [50]:
persondf.sample(2)

Unnamed: 0,URL,Form990ScheduleJPartII
129223,https://s3.amazonaws.com/irs-form-990/201230729349300103_public.xml,"{'NamePerson': 'LINDA J BROWN', 'BaseCompensationFilingOrg': '142994', 'CompBasedOnRelatedOrgs': '0', 'BonusFilingOrg': '0', 'BonusRelatedOrgs': '0', 'OtherCompensationFilingOrg': '605', 'OtherCompensationRelatedOrgs': '0', 'DeferredCompFilingOrg': '10378', 'DeferredCompRelatedOrgs': '0', 'NontaxableBenefitsFilingOrg': '8693', 'NontaxableBenefitsRelatedOrgs': '0', 'TotalCompensationFilingOrg': '162670', 'TotalCompensationRelatedOrgs': '0', 'CompReportPrior990FilingOrg': '0', 'CompReportPrior990RelatedOrgs': '0'}"
1828819,https://s3.amazonaws.com/irs-form-990/202231249349301218_public.xml,"{'PersonNm': 'JENNA ROGERS KING', 'TitleTxt': 'DIR. OF ADMISSION & ENROLLMENT', 'BaseCompensationFilingOrgAmt': '244912', 'CompensationBasedOnRltdOrgsAmt': '0', 'BonusFilingOrganizationAmount': '0', 'BonusRelatedOrganizationsAmt': '0', 'OtherCompensationFilingOrgAmt': '0', 'OtherCompensationRltdOrgsAmt': '0', 'DeferredCompensationFlngOrgAmt': '16838', 'DeferredCompRltdOrgsAmt': '0', 'NontaxableBenefitsFilingOrgAmt': '46290', 'NontaxableBenefitsRltdOrgsAmt': '0', 'TotalCompensationFilingOrgAmt': '308040', 'TotalCompensationRltdOrgsAmt': '0'}"


In [51]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
persondf.to_pickle('Schedule J Part II (PERSON-LEVEL DF).pkl.gz', compression='gzip')

Current date and time :  2025-06-27 12:30:52 

CPU times: total: 1min 18s
Wall time: 1min 21s
