# Overview

Based off *IRS 990 e-File Data -- Excise Tax Project (4) --Parse Schedule J Part (II) and Generate Person-Level DF.ipynb*

In this notebook I read in the this file (N=1,650,410):
- *Excise Tax Project - Schedule J Part II (PERSON-LEVEL DF).pkl.gz*

I then read in a *concordance* file and a collased version (*new_variables_df*) for Part II:
- *concordance - Schedule J Part II (VERIFIED).pkl*
- *concordance - collapsed - Schedule J Part II (new_variables_df).pkl*

See the following notebook has detailed steps on how I created the additional *concordance* file variables and *new_variables_df*:
    - *IRS 990 e-File Data -- Excise Tax Project (3b) -- Parse Schedule J Part II (person-level DF) -- NOT NEEDED -- ONLY FOR CODED FOR BUILDING CONCORDANCE FILE*
    - Note that I in the above notebook I updated the *concordance* and *new_variables_df* files that I read in here. 

I then parse the *Form990ScheduleJPartII* column. 
- *Note*: I determined it was quicker -- and less prone to crash -- if I *parsed* the *Form990PartVIISectionAGrp* column rather than 'flatten' it. I do this through a series of modified 'dictionary variable' parsing functions. I thus do the same thing here with *Form990ScheduleJPartII*.
- *Note*: I am now doing it differently from prior versions of this executive compensation notebook. Namely, I have create a better *xpath_top_len* variable (the same as in the 'Contractor Compensation' notebooks). All of the variables now have a value of '2' for *len_subkeys* in *new_variables_df*. So, I'll be looping over *new_variables_df* rows based on *xpath_top_len* instead.
- Process variables with *xpath_top_len* value of 2
- Process variables with *xpath_top_len* value of 3

I convert relevant columns to numeric:

    df[int_vars] = df[int_vars].apply(pd.to_numeric, errors='coerce')

Save DF -- with *Form990ScheduleJPartII* column. 

- *Schedule J Part II (PERSON-LEVEL DF) parsed -- with Form990ScheduleJPartII.pkl.gz*

Save DF -- without *Form990ScheduleJPartII* column. 

- *Schedule J Part II (PERSON-LEVEL DF) parsed.pkl.gz*

# Load Packages and Set Working Directory

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

In [2]:
print(pd.__version__)

2.2.2


In [3]:
from platform import python_version
print(python_version())

3.10.11


In [4]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 2500)

In [5]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
warnings.simplefilter(action='ignore', category=pd.errors.SettingWithCopyWarning)

#### Set working directory

In [6]:
cd "C:\\Users\\Gregory\\IRS 990 Control Variables\\"

C:\Users\Gregory\IRS 990 Control Variables


# Read PANDAS DF

In [7]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df = pd.read_pickle('Schedule J Part II (PERSON-LEVEL DF).pkl.gz', compression='gzip')
print('# of columns:', len(df.columns))
print('# of observations:', len(df))
df[:1]

Current date and time :  2025-06-27 12:33:58 

# of columns: 2
# of observations: 2972064
CPU times: total: 19.5 s
Wall time: 20.1 s


Unnamed: 0,URL,Form990ScheduleJPartII
0,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,"{'NamePerson': 'THOMAS D TURNBULL', 'CompBasedOnRelatedOrgs': '100712', 'BonusRelatedOrgs': '790', 'OtherCompensationRelatedOrgs': '1257', 'DeferredCompRelatedOrgs': '54308', 'NontaxableBenefitsRelatedOrgs': '62342', 'TotalCompensationRelatedOrgs': '219409'}"


# Read in concordance file

In [8]:
concordance = pd.read_pickle('concordance - Schedule J Part II (VERIFIED).pkl')
print('# of columns:', len(concordance.columns))
print('# of observations:', len(concordance))
concordance[:1]

# of columns: 17
# of observations: 38


Unnamed: 0,xpath,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,MongoDB_Name,sub_key,sub_sub_key,xpath_top_full,xpath_top,xpath_top_len,xpath_second
0,/Return/ReturnData/IRS990ScheduleJ/Form990ScheduleJPartII/NamePerson,SJ_02_PC_NAME_OFF_TRST_KEYEMP,,,,,Name of officer - person,SCHED-J-PART-02-COL-A-(i),PART-02,PersonNameType,NamePerson,NamePerson,,Form990ScheduleJPartII/NamePerson,Form990ScheduleJPartII,2,


In [9]:
print(len(concordance['variable_name_new'].tolist()))
print(len(set(concordance['variable_name_new'].tolist())))

38
18


<br>Read in DF collapsed by new variable name. There are 16 variables that we ultimately need to work with.

In [10]:
new_variables_df = pd.read_pickle('concordance - collapsed - Schedule J Part II (new_variables_df).pkl')
print('# of columns:', len(new_variables_df.columns))
print('# of observations:', len(new_variables_df))
new_variables_df[:1]

# of columns: 9
# of observations: 18


Unnamed: 0,variable_name_new,sub_keys,xpaths,xpaths_second,xpaths_third,xpath_top_len,data_type_xsd,len_subkeys,xpath_second_len
0,SJ_02_PC_COMP_BASE,"[BaseCompensationFilingOrgAmt, BaseCompensationFilingOrg]","[RltdOrgOfficerTrstKeyEmplGrp/BaseCompensationFilingOrgAmt, Form990ScheduleJPartII/BaseCompensationFilingOrg]",[nan],,2,USAmountType,2,1


### Write Functions

In [11]:
new_variables_df['len_subkeys'].value_counts()

len_subkeys
2    18
Name: count, dtype: int64

In [12]:
for index, row in df[:2].iterrows():
    print(type(row['Form990ScheduleJPartII']))

<class 'dict'>
<class 'dict'>


In [13]:
def func(x, key1, key2):
    if pd.isnull(x):
        return np.nan
    #else: 
    #    mydict = ast.literal_eval(x)
    elif key1 in x.keys():
        return x[key1]
    elif key2 in x.keys():
        return x[key2]
    else:
        return np.nan

#### Process variables with *xpath_top_len* value of 2

In [14]:
print(len(new_variables_df[new_variables_df['xpath_top_len']==2]))
new_variables_df[new_variables_df['xpath_top_len']==2]

16


Unnamed: 0,variable_name_new,sub_keys,xpaths,xpaths_second,xpaths_third,xpath_top_len,data_type_xsd,len_subkeys,xpath_second_len
0,SJ_02_PC_COMP_BASE,"[BaseCompensationFilingOrgAmt, BaseCompensationFilingOrg]","[RltdOrgOfficerTrstKeyEmplGrp/BaseCompensationFilingOrgAmt, Form990ScheduleJPartII/BaseCompensationFilingOrg]",[nan],,2,USAmountType,2,1
15,SJ_02_PC_NONTAXED_BENF,"[NontaxableBenefitsFilingOrgAmt, NontaxableBenefitsFilingOrg]","[Form990ScheduleJPartII/NontaxableBenefitsFilingOrg, RltdOrgOfficerTrstKeyEmplGrp/NontaxableBenefitsFilingOrgAmt]",[nan],,2,USAmountType,2,1
12,SJ_02_PC_NAME_OFF_TRST_KEYEMP,"[NamePerson, PersonNm]","[RltdOrgOfficerTrstKeyEmplGrp/PersonNm, Form990ScheduleJPartII/NamePerson]",[nan],,2,PersonNameType,2,1
11,SJ_02_PC_COMP_TOTAL_RELATED,"[TotalCompensationRltdOrgsAmt, TotalCompensationRelatedOrgs]","[RltdOrgOfficerTrstKeyEmplGrp/TotalCompensationRltdOrgsAmt, Form990ScheduleJPartII/TotalCompensationRelatedOrgs]",[nan],,2,USAmountType,2,1
10,SJ_02_PC_COMP_TOTAL,"[TotalCompensationFilingOrg, TotalCompensationFilingOrgAmt]","[RltdOrgOfficerTrstKeyEmplGrp/TotalCompensationFilingOrgAmt, Form990ScheduleJPartII/TotalCompensationFilingOrg]",[nan],,2,USAmountType,2,1
9,SJ_02_PC_COMP_OTHER_RELATED,"[OtherCompensationRelatedOrgs, OtherCompensationRltdOrgsAmt]","[Form990ScheduleJPartII/OtherCompensationRelatedOrgs, RltdOrgOfficerTrstKeyEmplGrp/OtherCompensationRltdOrgsAmt]",[nan],,2,USAmountType,2,1
16,SJ_02_PC_NONTAXED_BENF_RELATED,"[NontaxableBenefitsRelatedOrgs, NontaxableBenefitsRltdOrgsAmt]","[RltdOrgOfficerTrstKeyEmplGrp/NontaxableBenefitsRltdOrgsAmt, Form990ScheduleJPartII/NontaxableBenefitsRelatedOrgs]",[nan],,2,USAmountType,2,1
8,SJ_02_PC_COMP_OTHER,"[OtherCompensationFilingOrgAmt, OtherCompensationFilingOrg]","[Form990ScheduleJPartII/OtherCompensationFilingOrg, RltdOrgOfficerTrstKeyEmplGrp/OtherCompensationFilingOrgAmt]",[nan],,2,USAmountType,2,1
6,SJ_02_PC_COMP_DEF_PRIOR,"[CompReportPrior990FilingOrg, CompReportPrior990FilingOrgAmt]","[Form990ScheduleJPartII/CompReportPrior990FilingOrg, RltdOrgOfficerTrstKeyEmplGrp/CompReportPrior990FilingOrgAmt]",[nan],,2,USAmountType,2,1
5,SJ_02_PC_COMP_DEFERRED_RELATED,"[DeferredCompRltdOrgsAmt, DeferredCompRelatedOrgs]","[RltdOrgOfficerTrstKeyEmplGrp/DeferredCompRltdOrgsAmt, Form990ScheduleJPartII/DeferredCompRelatedOrgs]",[nan],,2,USAmountType,2,1


<br>To see the full *xpath* for the above variables, we can look at the *concordance* file.

In [15]:
concordance[concordance['xpath_top_len']==2][:2]

Unnamed: 0,xpath,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,MongoDB_Name,sub_key,sub_sub_key,xpath_top_full,xpath_top,xpath_top_len,xpath_second
0,/Return/ReturnData/IRS990ScheduleJ/Form990ScheduleJPartII/NamePerson,SJ_02_PC_NAME_OFF_TRST_KEYEMP,,,,,Name of officer - person,SCHED-J-PART-02-COL-A-(i),PART-02,PersonNameType,NamePerson,NamePerson,,Form990ScheduleJPartII/NamePerson,Form990ScheduleJPartII,2,
1,/Return/ReturnData/IRS990ScheduleJ/RltdOrgOfficerTrstKeyEmplGrp/PersonNm,SJ_02_PC_NAME_OFF_TRST_KEYEMP,,,,,Name of officer - person,SCHED-J-PART-02-COL-A-(i),PART-02,PersonNameType,PersonNm,PersonNm,,RltdOrgOfficerTrstKeyEmplGrp/PersonNm,RltdOrgOfficerTrstKeyEmplGrp,2,


In [16]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
for index, row in new_variables_df[new_variables_df['xpath_top_len']==2][:].iterrows():
    variable = row['variable_name_new']
    keys = row['sub_keys']
    key1 = keys[0]
    key2 = keys[1]
    print(variable, key1, key2)
    df[variable] = df['Form990ScheduleJPartII'][:].apply(func, key1=key1, key2=key2)

Current date and time :  2025-06-27 12:34:38 

SJ_02_PC_COMP_BASE BaseCompensationFilingOrgAmt BaseCompensationFilingOrg
SJ_02_PC_NONTAXED_BENF NontaxableBenefitsFilingOrgAmt NontaxableBenefitsFilingOrg
SJ_02_PC_NAME_OFF_TRST_KEYEMP NamePerson PersonNm
SJ_02_PC_COMP_TOTAL_RELATED TotalCompensationRltdOrgsAmt TotalCompensationRelatedOrgs
SJ_02_PC_COMP_TOTAL TotalCompensationFilingOrg TotalCompensationFilingOrgAmt
SJ_02_PC_COMP_OTHER_RELATED OtherCompensationRelatedOrgs OtherCompensationRltdOrgsAmt
SJ_02_PC_NONTAXED_BENF_RELATED NontaxableBenefitsRelatedOrgs NontaxableBenefitsRltdOrgsAmt
SJ_02_PC_COMP_OTHER OtherCompensationFilingOrgAmt OtherCompensationFilingOrg
SJ_02_PC_COMP_DEF_PRIOR CompReportPrior990FilingOrg CompReportPrior990FilingOrgAmt
SJ_02_PC_COMP_DEFERRED_RELATED DeferredCompRltdOrgsAmt DeferredCompRelatedOrgs
SJ_02_PC_COMP_DEFERRED DeferredCompFilingOrg DeferredCompensationFlngOrgAmt
SJ_02_PC_COMP_BONUS_RELATED BonusRelatedOrganizationsAmt BonusRelatedOrgs
SJ_02_PC_COMP_BONU

In [17]:
df.sample(1)

Unnamed: 0,URL,Form990ScheduleJPartII,SJ_02_PC_COMP_BASE,SJ_02_PC_NONTAXED_BENF,SJ_02_PC_NAME_OFF_TRST_KEYEMP,SJ_02_PC_COMP_TOTAL_RELATED,SJ_02_PC_COMP_TOTAL,SJ_02_PC_COMP_OTHER_RELATED,SJ_02_PC_NONTAXED_BENF_RELATED,SJ_02_PC_COMP_OTHER,SJ_02_PC_COMP_DEF_PRIOR,SJ_02_PC_COMP_DEFERRED_RELATED,SJ_02_PC_COMP_DEFERRED,SJ_02_PC_COMP_BONUS_RELATED,SJ_02_PC_COMP_BONUS,SJ_02_PC_COMP_BASE_RELATED,SJ_02_PC_COMP_DEF_PRIOR_RELATED,SJ_02_PC_TITLE
1753995,https://s3.amazonaws.com/irs-form-990/202202229349300830_public.xml,"{'PersonNm': 'GREGORY P WOLF', 'TitleTxt': 'GENERAL MANAGER', 'BaseCompensationFilingOrgAmt': '355115', 'CompensationBasedOnRltdOrgsAmt': '0', 'BonusFilingOrganizationAmount': '63000', 'BonusRelatedOrganizationsAmt': '0', 'OtherCompensationFilingOrgAmt': '13774', 'OtherCompensationRltdOrgsAmt': '0', 'DeferredCompensationFlngOrgAmt': '12825', 'DeferredCompRltdOrgsAmt': '0', 'NontaxableBenefitsFilingOrgAmt': '2007', 'NontaxableBenefitsRltdOrgsAmt': '0', 'TotalCompensationFilingOrgAmt': '446721', 'TotalCompensationRltdOrgsAmt': '0', 'CompReportPrior990FilingOrgAmt': '0', 'CompReportPrior990RltdOrgsAmt': '0'}",355115,2007,GREGORY P WOLF,0,446721,0,0,13774,0,0,12825,0,63000,0,0,GENERAL MANAGER


#### Process variables with *xpath_top_len* value of 3

In [18]:
new_variables_df[new_variables_df['xpath_top_len']==3]

Unnamed: 0,variable_name_new,sub_keys,xpaths,xpaths_second,xpaths_third,xpath_top_len,data_type_xsd,len_subkeys,xpath_second_len
13,SJ_02_PC_NAME_OFF_TRST_KEYEMP_L1,"[BusinessNameLine1, BusinessNameLine1Txt]","[RltdOrgOfficerTrstKeyEmplGrp/BusinessName/BusinessNameLine1Txt, RltdOrgOfficerTrstKeyEmplGrp/BusinessName/BusinessNameLine1, Form990ScheduleJPartII/NameBusiness/BusinessNameLine1]","[BusinessName, NameBusiness]",,3,BusinessNameLine1Type,2,2
14,SJ_02_PC_NAME_OFF_TRST_KEYEMP_L2,"[BusinessNameLine2Txt, BusinessNameLine2]","[RltdOrgOfficerTrstKeyEmplGrp/BusinessName/BusinessNameLine2Txt, RltdOrgOfficerTrstKeyEmplGrp/BusinessName/BusinessNameLine2, Form990ScheduleJPartII/NameBusiness/BusinessNameLine2]","[BusinessName, NameBusiness]",,3,BusinessNameLine2Type,2,2


In [19]:
def func_three_levels(x, level1, level2, key1, key2):
    if pd.isnull(x):
        return np.nan
    #else: 
    #    mydict = ast.literal_eval(x)
    elif level1 in x.keys():
        if key1 in x[level1].keys():
            return x[level1][key1]
        elif key2 in x[level1].keys():
            return x[level1][key2]
    elif level2 in x.keys():    
        if key1 in x[level2].keys():
            return x[level2][key1]
        elif key2 in x[level2].keys():
            return x[level2][key2]
    else:
        return np.nan

In [20]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
for index, row in new_variables_df[new_variables_df['xpath_top_len']==3][:].iterrows():
    variable = row['variable_name_new'] 
    levels = row['xpaths_second']
    level1 = levels[0]
    level2 = levels[1]    
    keys = row['sub_keys']
    key1 = keys[0]
    key2 = keys[1]
    print(variable, level1, level2, key1, key2)    
    df[variable] = df['Form990ScheduleJPartII'][:].apply(func_three_levels, level1=level1, level2=level2,
                                                            key1=key1, key2=key2)

Current date and time :  2025-06-27 12:53:53 

SJ_02_PC_NAME_OFF_TRST_KEYEMP_L1 BusinessName NameBusiness BusinessNameLine1 BusinessNameLine1Txt
SJ_02_PC_NAME_OFF_TRST_KEYEMP_L2 BusinessName NameBusiness BusinessNameLine2Txt BusinessNameLine2
CPU times: total: 25.3 s
Wall time: 26.1 s


In [21]:
pd.set_option('display.float_format', lambda x: '%.1f' % x)

In [22]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
for c in df.columns.tolist():
    print(c, len(df[df[c].notnull()]))

Current date and time :  2025-06-27 12:57:20 

URL 2972064
Form990ScheduleJPartII 2972064
SJ_02_PC_COMP_BASE 2807086
SJ_02_PC_NONTAXED_BENF 2699243
SJ_02_PC_NAME_OFF_TRST_KEYEMP 2809765
SJ_02_PC_COMP_TOTAL_RELATED 2608804
SJ_02_PC_COMP_TOTAL 2812158
SJ_02_PC_COMP_OTHER_RELATED 2523543
SJ_02_PC_NONTAXED_BENF_RELATED 2575018
SJ_02_PC_COMP_OTHER 2560643
SJ_02_PC_COMP_DEF_PRIOR 2434654
SJ_02_PC_COMP_DEFERRED_RELATED 2566118
SJ_02_PC_COMP_DEFERRED 2679030
SJ_02_PC_COMP_BONUS_RELATED 2508267
SJ_02_PC_COMP_BONUS 2556043
SJ_02_PC_COMP_BASE_RELATED 2601343
SJ_02_PC_COMP_DEF_PRIOR_RELATED 2426826
SJ_02_PC_TITLE 2578078
SJ_02_PC_NAME_OFF_TRST_KEYEMP_L1 161718
SJ_02_PC_NAME_OFF_TRST_KEYEMP_L2 249
CPU times: total: 39.1 s
Wall time: 41 s


In [23]:
set(new_variables_df['variable_name_new'].tolist()) - set(df.columns.tolist())

set()

In [24]:
set(df.columns.tolist()) - set(new_variables_df['variable_name_new'].tolist())

{'Form990ScheduleJPartII', 'URL'}

### Change Data Type

In [25]:
df.dtypes

URL                                 object
Form990ScheduleJPartII              object
SJ_02_PC_COMP_BASE                  object
SJ_02_PC_NONTAXED_BENF              object
SJ_02_PC_NAME_OFF_TRST_KEYEMP       object
SJ_02_PC_COMP_TOTAL_RELATED         object
SJ_02_PC_COMP_TOTAL                 object
SJ_02_PC_COMP_OTHER_RELATED         object
SJ_02_PC_NONTAXED_BENF_RELATED      object
SJ_02_PC_COMP_OTHER                 object
SJ_02_PC_COMP_DEF_PRIOR             object
SJ_02_PC_COMP_DEFERRED_RELATED      object
SJ_02_PC_COMP_DEFERRED              object
SJ_02_PC_COMP_BONUS_RELATED         object
SJ_02_PC_COMP_BONUS                 object
SJ_02_PC_COMP_BASE_RELATED          object
SJ_02_PC_COMP_DEF_PRIOR_RELATED     object
SJ_02_PC_TITLE                      object
SJ_02_PC_NAME_OFF_TRST_KEYEMP_L1    object
SJ_02_PC_NAME_OFF_TRST_KEYEMP_L2    object
dtype: object

In [26]:
new_variables_df['data_type_xsd'].value_counts()

data_type_xsd
USAmountType             14
PersonNameType            1
LineExplanationType       1
BusinessNameLine1Type     1
BusinessNameLine2Type     1
Name: count, dtype: int64

In [27]:
int_vars = new_variables_df[new_variables_df['data_type_xsd']=='USAmountType']['variable_name_new'].tolist()
print(len(int_vars))
int_vars

14


['SJ_02_PC_COMP_BASE',
 'SJ_02_PC_NONTAXED_BENF',
 'SJ_02_PC_COMP_TOTAL_RELATED',
 'SJ_02_PC_COMP_TOTAL',
 'SJ_02_PC_COMP_OTHER_RELATED',
 'SJ_02_PC_NONTAXED_BENF_RELATED',
 'SJ_02_PC_COMP_OTHER',
 'SJ_02_PC_COMP_DEF_PRIOR',
 'SJ_02_PC_COMP_DEFERRED_RELATED',
 'SJ_02_PC_COMP_DEFERRED',
 'SJ_02_PC_COMP_BONUS_RELATED',
 'SJ_02_PC_COMP_BONUS',
 'SJ_02_PC_COMP_BASE_RELATED',
 'SJ_02_PC_COMP_DEF_PRIOR_RELATED']

In [28]:
df[int_vars].dtypes

SJ_02_PC_COMP_BASE                 object
SJ_02_PC_NONTAXED_BENF             object
SJ_02_PC_COMP_TOTAL_RELATED        object
SJ_02_PC_COMP_TOTAL                object
SJ_02_PC_COMP_OTHER_RELATED        object
SJ_02_PC_NONTAXED_BENF_RELATED     object
SJ_02_PC_COMP_OTHER                object
SJ_02_PC_COMP_DEF_PRIOR            object
SJ_02_PC_COMP_DEFERRED_RELATED     object
SJ_02_PC_COMP_DEFERRED             object
SJ_02_PC_COMP_BONUS_RELATED        object
SJ_02_PC_COMP_BONUS                object
SJ_02_PC_COMP_BASE_RELATED         object
SJ_02_PC_COMP_DEF_PRIOR_RELATED    object
dtype: object

In [29]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df[int_vars] = df[int_vars].apply(pd.to_numeric, errors='coerce')

Current date and time :  2025-06-27 12:58:32 

CPU times: total: 23.5 s
Wall time: 24.1 s


In [30]:
df[int_vars].dtypes

SJ_02_PC_COMP_BASE                 float64
SJ_02_PC_NONTAXED_BENF             float64
SJ_02_PC_COMP_TOTAL_RELATED        float64
SJ_02_PC_COMP_TOTAL                float64
SJ_02_PC_COMP_OTHER_RELATED        float64
SJ_02_PC_NONTAXED_BENF_RELATED     float64
SJ_02_PC_COMP_OTHER                float64
SJ_02_PC_COMP_DEF_PRIOR            float64
SJ_02_PC_COMP_DEFERRED_RELATED     float64
SJ_02_PC_COMP_DEFERRED             float64
SJ_02_PC_COMP_BONUS_RELATED        float64
SJ_02_PC_COMP_BONUS                float64
SJ_02_PC_COMP_BASE_RELATED         float64
SJ_02_PC_COMP_DEF_PRIOR_RELATED    float64
dtype: object

In [31]:
df.dtypes

URL                                  object
Form990ScheduleJPartII               object
SJ_02_PC_COMP_BASE                  float64
SJ_02_PC_NONTAXED_BENF              float64
SJ_02_PC_NAME_OFF_TRST_KEYEMP        object
SJ_02_PC_COMP_TOTAL_RELATED         float64
SJ_02_PC_COMP_TOTAL                 float64
SJ_02_PC_COMP_OTHER_RELATED         float64
SJ_02_PC_NONTAXED_BENF_RELATED      float64
SJ_02_PC_COMP_OTHER                 float64
SJ_02_PC_COMP_DEF_PRIOR             float64
SJ_02_PC_COMP_DEFERRED_RELATED      float64
SJ_02_PC_COMP_DEFERRED              float64
SJ_02_PC_COMP_BONUS_RELATED         float64
SJ_02_PC_COMP_BONUS                 float64
SJ_02_PC_COMP_BASE_RELATED          float64
SJ_02_PC_COMP_DEF_PRIOR_RELATED     float64
SJ_02_PC_TITLE                       object
SJ_02_PC_NAME_OFF_TRST_KEYEMP_L1     object
SJ_02_PC_NAME_OFF_TRST_KEYEMP_L2     object
dtype: object

#### Save DF -- with *Form990ScheduleJPartII*

In [32]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df.to_pickle('Schedule J Part II (PERSON-LEVEL DF) parsed -- with Form990ScheduleJPartII.pkl.gz', compression='gzip')

Current date and time :  2025-06-27 13:33:54 

CPU times: total: 3min 53s
Wall time: 3min 59s


#### Save DF -- without *Form990ScheduleJPartII*

In [33]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
for c in df.columns.tolist():
    print(c, len(df[df[c].notnull()]))

Current date and time :  2025-06-27 13:57:40 

URL 2972064
Form990ScheduleJPartII 2972064
SJ_02_PC_COMP_BASE 2807086
SJ_02_PC_NONTAXED_BENF 2699243
SJ_02_PC_NAME_OFF_TRST_KEYEMP 2809765
SJ_02_PC_COMP_TOTAL_RELATED 2608804
SJ_02_PC_COMP_TOTAL 2812158
SJ_02_PC_COMP_OTHER_RELATED 2523543
SJ_02_PC_NONTAXED_BENF_RELATED 2575018
SJ_02_PC_COMP_OTHER 2560643
SJ_02_PC_COMP_DEF_PRIOR 2434654
SJ_02_PC_COMP_DEFERRED_RELATED 2566118
SJ_02_PC_COMP_DEFERRED 2679030
SJ_02_PC_COMP_BONUS_RELATED 2508267
SJ_02_PC_COMP_BONUS 2556043
SJ_02_PC_COMP_BASE_RELATED 2601343
SJ_02_PC_COMP_DEF_PRIOR_RELATED 2426826
SJ_02_PC_TITLE 2578078
SJ_02_PC_NAME_OFF_TRST_KEYEMP_L1 161718
SJ_02_PC_NAME_OFF_TRST_KEYEMP_L2 249
CPU times: total: 18.9 s
Wall time: 19.7 s


In [34]:
df.dtypes

URL                                  object
Form990ScheduleJPartII               object
SJ_02_PC_COMP_BASE                  float64
SJ_02_PC_NONTAXED_BENF              float64
SJ_02_PC_NAME_OFF_TRST_KEYEMP        object
SJ_02_PC_COMP_TOTAL_RELATED         float64
SJ_02_PC_COMP_TOTAL                 float64
SJ_02_PC_COMP_OTHER_RELATED         float64
SJ_02_PC_NONTAXED_BENF_RELATED      float64
SJ_02_PC_COMP_OTHER                 float64
SJ_02_PC_COMP_DEF_PRIOR             float64
SJ_02_PC_COMP_DEFERRED_RELATED      float64
SJ_02_PC_COMP_DEFERRED              float64
SJ_02_PC_COMP_BONUS_RELATED         float64
SJ_02_PC_COMP_BONUS                 float64
SJ_02_PC_COMP_BASE_RELATED          float64
SJ_02_PC_COMP_DEF_PRIOR_RELATED     float64
SJ_02_PC_TITLE                       object
SJ_02_PC_NAME_OFF_TRST_KEYEMP_L1     object
SJ_02_PC_NAME_OFF_TRST_KEYEMP_L2     object
dtype: object

In [35]:
print(df.columns.tolist())

['URL', 'Form990ScheduleJPartII', 'SJ_02_PC_COMP_BASE', 'SJ_02_PC_NONTAXED_BENF', 'SJ_02_PC_NAME_OFF_TRST_KEYEMP', 'SJ_02_PC_COMP_TOTAL_RELATED', 'SJ_02_PC_COMP_TOTAL', 'SJ_02_PC_COMP_OTHER_RELATED', 'SJ_02_PC_NONTAXED_BENF_RELATED', 'SJ_02_PC_COMP_OTHER', 'SJ_02_PC_COMP_DEF_PRIOR', 'SJ_02_PC_COMP_DEFERRED_RELATED', 'SJ_02_PC_COMP_DEFERRED', 'SJ_02_PC_COMP_BONUS_RELATED', 'SJ_02_PC_COMP_BONUS', 'SJ_02_PC_COMP_BASE_RELATED', 'SJ_02_PC_COMP_DEF_PRIOR_RELATED', 'SJ_02_PC_TITLE', 'SJ_02_PC_NAME_OFF_TRST_KEYEMP_L1', 'SJ_02_PC_NAME_OFF_TRST_KEYEMP_L2']


In [36]:
concordance['variable_name_new'].tolist()

['SJ_02_PC_NAME_OFF_TRST_KEYEMP',
 'SJ_02_PC_NAME_OFF_TRST_KEYEMP',
 'SJ_02_PC_NAME_OFF_TRST_KEYEMP_L1',
 'SJ_02_PC_NAME_OFF_TRST_KEYEMP_L1',
 'SJ_02_PC_NAME_OFF_TRST_KEYEMP_L1',
 'SJ_02_PC_NAME_OFF_TRST_KEYEMP_L2',
 'SJ_02_PC_NAME_OFF_TRST_KEYEMP_L2',
 'SJ_02_PC_NAME_OFF_TRST_KEYEMP_L2',
 'SJ_02_PC_TITLE',
 'SJ_02_PC_TITLE',
 'SJ_02_PC_COMP_BASE',
 'SJ_02_PC_COMP_BASE',
 'SJ_02_PC_COMP_BASE_RELATED',
 'SJ_02_PC_COMP_BASE_RELATED',
 'SJ_02_PC_COMP_BONUS',
 'SJ_02_PC_COMP_BONUS',
 'SJ_02_PC_COMP_BONUS_RELATED',
 'SJ_02_PC_COMP_BONUS_RELATED',
 'SJ_02_PC_COMP_OTHER',
 'SJ_02_PC_COMP_OTHER',
 'SJ_02_PC_COMP_OTHER_RELATED',
 'SJ_02_PC_COMP_OTHER_RELATED',
 'SJ_02_PC_COMP_DEFERRED',
 'SJ_02_PC_COMP_DEFERRED',
 'SJ_02_PC_COMP_DEFERRED_RELATED',
 'SJ_02_PC_COMP_DEFERRED_RELATED',
 'SJ_02_PC_NONTAXED_BENF',
 'SJ_02_PC_NONTAXED_BENF',
 'SJ_02_PC_NONTAXED_BENF_RELATED',
 'SJ_02_PC_NONTAXED_BENF_RELATED',
 'SJ_02_PC_COMP_TOTAL',
 'SJ_02_PC_COMP_TOTAL',
 'SJ_02_PC_COMP_TOTAL_RELATED',
 'SJ_02_PC_C

<br>Re-arrange column order

In [37]:
cols = ['URL', #'Form990ScheduleJPartII', 
        'SJ_02_PC_NAME_OFF_TRST_KEYEMP', 'SJ_02_PC_NAME_OFF_TRST_KEYEMP_L1', 'SJ_02_PC_NAME_OFF_TRST_KEYEMP_L2',
        'SJ_02_PC_TITLE',
        'SJ_02_PC_COMP_BASE', 'SJ_02_PC_COMP_BASE_RELATED', 'SJ_02_PC_COMP_BONUS', 'SJ_02_PC_COMP_BONUS_RELATED',
        'SJ_02_PC_COMP_OTHER', 'SJ_02_PC_COMP_OTHER_RELATED',
        'SJ_02_PC_COMP_DEFERRED', 'SJ_02_PC_COMP_DEFERRED_RELATED', 
        'SJ_02_PC_NONTAXED_BENF', 'SJ_02_PC_NONTAXED_BENF_RELATED', 
        'SJ_02_PC_COMP_TOTAL', 'SJ_02_PC_COMP_TOTAL_RELATED',
        'SJ_02_PC_COMP_DEF_PRIOR', 'SJ_02_PC_COMP_DEF_PRIOR_RELATED', 
        ]
df[cols].sample(5)

Unnamed: 0,URL,SJ_02_PC_NAME_OFF_TRST_KEYEMP,SJ_02_PC_NAME_OFF_TRST_KEYEMP_L1,SJ_02_PC_NAME_OFF_TRST_KEYEMP_L2,SJ_02_PC_TITLE,SJ_02_PC_COMP_BASE,SJ_02_PC_COMP_BASE_RELATED,SJ_02_PC_COMP_BONUS,SJ_02_PC_COMP_BONUS_RELATED,SJ_02_PC_COMP_OTHER,SJ_02_PC_COMP_OTHER_RELATED,SJ_02_PC_COMP_DEFERRED,SJ_02_PC_COMP_DEFERRED_RELATED,SJ_02_PC_NONTAXED_BENF,SJ_02_PC_NONTAXED_BENF_RELATED,SJ_02_PC_COMP_TOTAL,SJ_02_PC_COMP_TOTAL_RELATED,SJ_02_PC_COMP_DEF_PRIOR,SJ_02_PC_COMP_DEF_PRIOR_RELATED
32332,https://s3.amazonaws.com/irs-form-990/201121369349302597_public.xml,STEVEN SCHWARTZ,,,,446763.0,,,,480.0,,,,40217.0,,487460.0,,,
1309535,https://s3.amazonaws.com/irs-form-990/201932279349303063_public.xml,Eric D Berger,,,VP Supply Chain,,181687.0,,,,8416.0,,,,33820.0,,223923.0,,
1170751,https://s3.amazonaws.com/irs-form-990/201832819349301103_public.xml,Josephine Bias Robinson,,,"EVP, external affairs",0.0,256768.0,0.0,0.0,0.0,15288.0,0.0,16500.0,0.0,0.0,0.0,288556.0,0.0,0.0
2152692,https://s3.amazonaws.com/irs-form-990/202301319349307515_public.xml,BRIAN T CORBETT ESQ,,,SECRETARY & TRUSTEE,0.0,508710.0,0.0,166404.0,0.0,107.0,0.0,107277.0,0.0,36719.0,0.0,819217.0,0.0,101049.0
1117359,https://s3.amazonaws.com/irs-form-990/201831249349301018_public.xml,,DOUGLAS WICKERHAM,,TREASURER,0.0,313517.0,0.0,24530.0,0.0,22044.0,0.0,2977.0,0.0,17040.0,0.0,380108.0,0.0,0.0


##### Check that columns match expectations

In [38]:
set(cols) - set(concordance['variable_name_new'].tolist())

{'URL'}

In [39]:
set(concordance['variable_name_new'].tolist()) - set(cols)

set()

In [40]:
set(cols) - set(df.columns.tolist())

set()

In [41]:
set(df.columns.tolist()) - set(cols)

{'Form990ScheduleJPartII'}

In [42]:
df[cols].sample(3)

Unnamed: 0,URL,SJ_02_PC_NAME_OFF_TRST_KEYEMP,SJ_02_PC_NAME_OFF_TRST_KEYEMP_L1,SJ_02_PC_NAME_OFF_TRST_KEYEMP_L2,SJ_02_PC_TITLE,SJ_02_PC_COMP_BASE,SJ_02_PC_COMP_BASE_RELATED,SJ_02_PC_COMP_BONUS,SJ_02_PC_COMP_BONUS_RELATED,SJ_02_PC_COMP_OTHER,SJ_02_PC_COMP_OTHER_RELATED,SJ_02_PC_COMP_DEFERRED,SJ_02_PC_COMP_DEFERRED_RELATED,SJ_02_PC_NONTAXED_BENF,SJ_02_PC_NONTAXED_BENF_RELATED,SJ_02_PC_COMP_TOTAL,SJ_02_PC_COMP_TOTAL_RELATED,SJ_02_PC_COMP_DEF_PRIOR,SJ_02_PC_COMP_DEF_PRIOR_RELATED
497189,https://s3.amazonaws.com/irs-form-990/201413219349306276_public.xml,Dr Stephen Baum,,,"Trustee, Physician",253224.0,0.0,0.0,0.0,19181.0,0.0,30978.0,0.0,6082.0,0.0,309465.0,0.0,0.0,0.0
2429302,https://s3.amazonaws.com/irs-form-990/202442819349302369_public.xml,STEVE JENKINS,,,DEPUTY EXECUTIVE DIRECTOR,407.0,0.0,0.0,0.0,0.0,0.0,10.0,0.0,49.0,0.0,466.0,0.0,0.0,0.0
518672,https://s3.amazonaws.com/irs-form-990/201343199349304254_public.xml,MARIA GHORMLEY,,,EMERGENCY NURSE,165307.0,0.0,0.0,0.0,0.0,0.0,4605.0,0.0,11058.0,0.0,180970.0,0.0,0.0,0.0


<br>Save file(s) without *Form990ScheduleJPartII*

In [43]:
len(df)

2972064

In [44]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df[cols].to_pickle('Schedule J Part II (PERSON-LEVEL DF) parsed.pkl.gz', compression='gzip')

Current date and time :  2025-06-27 14:04:32 

CPU times: total: 2min 32s
Wall time: 2min 37s
