# Overview

Based off *IRS 990 e-File Data -- Excise Tax Project (2) -- Schedule J Part (I) -- Verify New Variables and Combine and Binarize Columns.ipynb*

From the prior notebook I read in this file:
- *Schedule J (Part I).pkl.gz*

In this notebook I read in the Part I of Schedule J, along with the unverified concordance file. After initial verifications I go back to Excel and modify the concordance file, read it back in, and perform additional verifications. I then combine and binarize the 26 Schedule J (Part I) columns and save the following files:

- *Schedule J (Part I) - parsed.pkl.gz*

In [2]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

In [3]:
print(pd.__version__)

2.2.2


In [4]:
from platform import python_version
print(python_version())

3.10.11


In [5]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)

In [6]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [7]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
warnings.simplefilter(action='ignore', category=pd.errors.SettingWithCopyWarning)

#### Set working directory

In [8]:
#cd '/Users/gsaxton/Dropbox/990 e-file data'

In [9]:
pwd

'C:\\Users\\Gregory\\Jupyter_Notebooks'

In [10]:
cd "C:\\Users\\Gregory\\IRS 990 Control Variables\\"

C:\Users\Gregory\IRS 990 Control Variables


# Read in Concordance File
We are going to read in two codebooks. First, there is the 'concordance' file. Specifically, before re-arranging and renaming variables, we will read in the relevant section from the *master concordance* file, and then use this file to identify the relevant 'compensation' variables. In a following notebook, we will be using the *new_variable_name* field as our variable name.

In [11]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
concordance = pd.read_excel('concordance - Schedule J (Part I).xlsx')
print('# of columns:', len(concordance.columns))
print('# of observations:', len(concordance))
concordance[:2]

Current date and time :  2025-06-20 22:28:28 

# of columns: 15
# of observations: 52
CPU times: total: 172 ms
Wall time: 11.2 s


Unnamed: 0,xpath,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,fill_null,BINARIZE,MongoDB_Name,sub_key,sub_sub_key
0,/Return/ReturnData/IRS990ScheduleJ/ClubDuesOrFees,SJ_01_PC_CLUB_FEES,,,,,Club dues or fees,SCHED-J-PART-01-LINE-1a,PART-01,CheckboxType,,binarize,ClubDuesOrFees,,
1,/Return/ReturnData/IRS990ScheduleJ/ClubDuesOrFeesInd,SJ_01_PC_CLUB_FEES,,,,,Club dues or fees,SCHED-J-PART-01-LINE-1a,PART-01,CheckboxType,,binarize,ClubDuesOrFeesInd,,


In [12]:
concordance[concordance['data_type_xsd'].isnull()][['variable_name_new', 'description']]

Unnamed: 0,variable_name_new,description


In [13]:
concordance['data_type_xsd'].value_counts()

data_type_xsd
CheckboxType    28
BooleanType     24
Name: count, dtype: int64

In [14]:
concordance[concordance['data_type_xsd']=='BooleanType'][:1]

Unnamed: 0,xpath,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,fill_null,BINARIZE,MongoDB_Name,sub_key,sub_sub_key
16,/Return/ReturnData/IRS990ScheduleJ/WrittenPolicyRefTAndEExpnssInd,SJ_01_PC_WRITTEN_POLICY,,,,,Written policy reference T and E expenses?,SCHED-J-PART-01-LINE-1b,PART-01,BooleanType,,binarize,WrittenPolicyRefTAndEExpnssInd,,


In [15]:
concordance[concordance['data_type_xsd']=='CheckboxType'][:1]

Unnamed: 0,xpath,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,fill_null,BINARIZE,MongoDB_Name,sub_key,sub_sub_key
0,/Return/ReturnData/IRS990ScheduleJ/ClubDuesOrFees,SJ_01_PC_CLUB_FEES,,,,,Club dues or fees,SCHED-J-PART-01-LINE-1a,PART-01,CheckboxType,,binarize,ClubDuesOrFees,,


# Read 990 DB into PANDAS 
- In previous round there were 1,547,828 observations; in Feb. 2020 there were 1,727,056 observations; in Nov. 2020 there are 1,895,016 observations
- I also switched to using the *.pkl* file

In [16]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df = pd.read_pickle('Schedule J (Part I).pkl.gz', compression='gzip')
print('# of columns:', len(df.columns))
print('# of observations:', len(df))
df[:2]

Current date and time :  2025-06-20 22:29:27 

# of columns: 58
# of observations: 687646
CPU times: total: 4.73 s
Wall time: 5.94 s


Unnamed: 0,URL,@documentId,@softwareId,@softwareVersion,CompensationCommittee,CompensationSurvey,BoardOrCommitteeApproval,SeverancePayment,SupplementalNonqualRetirePlan,EquityBasedCompArrangement,CompBasedOnRevenueOfFilingOrg,CompBasedOnRevenueRelatedOrgs,CompBasedNetEarningsFilingOrg,CompBasedNetEarningsRelateOrgs,AnyNonFixedPayments,InitialContractException,RebuttablePresumptionProcedure,IndependentConsultant,WrittenEmploymentContract,HousingAllowanceOrResidence,WrittenPolicyReTAndEExpenses,SubstantiationRequired,IdemnificationGrossUpPayments,DiscretionarySpendingAccount,ClubDuesOrFees,FirstClassOrCharterTravel,TravelForCompanions,Form990OfOtherOrganizations,PaymentsForUseOfResidence,PersonalServices,SeverancePaymentInd,SupplementalNonqualRtrPlanInd,EquityBasedCompArrngmInd,CompBasedOnRevenueOfFlngOrgInd,CompBsdOnRevRelatedOrgsInd,CompBsdNetEarnsFlngOrgInd,CompBsdNetEarnsRltdOrgsInd,AnyNonFixedPaymentsInd,InitialContractExceptionInd,CompensationCommitteeInd,BoardOrCommitteeApprovalInd,RebuttablePresumptionProcInd,IndependentConsultantInd,WrittenEmploymentContractInd,CompensationSurveyInd,Form990OfOtherOrganizationsInd,DiscretionarySpendingAcctInd,WrittenPolicyRefTAndEExpnssInd,SubstantiationRequiredInd,TravelForCompanionsInd,IdemnificationGrossUpPmtsInd,ClubDuesOrFeesInd,HousingAllowanceOrResidenceInd,FirstClassOrCharterTravelInd,PersonalServicesInd,PaymentsForUseOfResidenceInd,@documentName,@softwareVersionNum
0,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,IRS990ScheduleJ,10000105,2010v3.2,X,X,X,False,True,False,False,False,False,False,False,False,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,https://s3.amazonaws.com/irs-form-990/201113139349301316_public.xml,IRS990ScheduleJ,10000105,2010v3.2,X,X,X,False,True,False,False,False,False,False,False,False,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


### Identify which *MongoDB_Names* and xpaths are not in data

In [17]:
print(df.columns.tolist()[:5])

['URL', '@documentId', '@softwareId', '@softwareVersion', 'CompensationCommittee']


In [18]:
print(concordance['MongoDB_Name'].tolist())

['ClubDuesOrFees', 'ClubDuesOrFeesInd', 'DiscretionarySpendingAccount', 'DiscretionarySpendingAcctInd', 'FirstClassOrCharterTravel', 'FirstClassOrCharterTravelInd', 'HousingAllowanceOrResidence', 'HousingAllowanceOrResidenceInd', 'IdemnificationGrossUpPayments', 'IdemnificationGrossUpPmtsInd', 'PaymentsForUseOfResidence', 'PaymentsForUseOfResidenceInd', 'PersonalServices', 'PersonalServicesInd', 'TravelForCompanions', 'TravelForCompanionsInd', 'WrittenPolicyRefTAndEExpnssInd', 'WrittenPolicyReTAndEExpenses', 'SubstantiationRequired', 'SubstantiationRequiredInd', 'BoardOrCommitteeApproval', 'BoardOrCommitteeApprovalInd', 'CompensationCommittee', 'CompensationCommitteeInd', 'CompensationSurvey', 'CompensationSurveyInd', 'Form990OfOtherOrganizations', 'Form990OfOtherOrganizationsInd', 'IndependentConsultant', 'IndependentConsultantInd', 'WrittenEmploymentContract', 'WrittenEmploymentContractInd', 'SeverancePayment', 'SeverancePaymentInd', 'SupplementalNonqualRetirePlan', 'SupplementalNonq

In [19]:
set(df.columns.tolist()) - set(concordance['MongoDB_Name'].tolist())

{'@documentId',
 '@documentName',
 '@softwareId',
 '@softwareVersion',
 '@softwareVersionNum',
 'URL'}

In [20]:
df[['@documentId',
 '@documentName',
 '@softwareId',
 '@softwareVersion',
 '@softwareVersionNum', 
 'URL']].describe().T

Unnamed: 0,count,unique,top,freq
@documentId,687646,529,RetDoc1042400001,416532
@documentName,1225,1,IRS990ScheduleJ,1225
@softwareId,103808,100,21013475,4861
@softwareVersion,25771,31,v1.00,5696
@softwareVersionNum,76149,41,v1.00,12993
URL,687646,687646,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,1


In [21]:
df[['@documentId',
 '@documentName',
 '@softwareId',
 '@softwareVersion',
 '@softwareVersionNum']].sample(15)

Unnamed: 0,@documentId,@documentName,@softwareId,@softwareVersion,@softwareVersionNum
105744,IRS990ScheduleJ,,12000229.0,2012v2.0,
71531,R000007,,11000129.0,v1.00,
145314,RetDoc1042400001,,,,
285295,RetDoc1042400001,,,,
376968,00000005,,,,
283947,RetDoc1042400001,,,,
97282,RetDoc1042400001,,,,
247884,RetDoc1042400001,,,,
395361,RetDoc1042400001,,,,
648921,RetDoc1042400001,,,,


#### Drop the five columns

In [22]:
df = df.drop('@documentId', axis=1)
df = df.drop('@documentName', axis=1)
df = df.drop('@softwareId', axis=1)
df = df.drop('@softwareVersion', axis=1)
df = df.drop('@softwareVersionNum', axis=1)
df[:1]

Unnamed: 0,URL,CompensationCommittee,CompensationSurvey,BoardOrCommitteeApproval,SeverancePayment,SupplementalNonqualRetirePlan,EquityBasedCompArrangement,CompBasedOnRevenueOfFilingOrg,CompBasedOnRevenueRelatedOrgs,CompBasedNetEarningsFilingOrg,CompBasedNetEarningsRelateOrgs,AnyNonFixedPayments,InitialContractException,RebuttablePresumptionProcedure,IndependentConsultant,WrittenEmploymentContract,HousingAllowanceOrResidence,WrittenPolicyReTAndEExpenses,SubstantiationRequired,IdemnificationGrossUpPayments,DiscretionarySpendingAccount,ClubDuesOrFees,FirstClassOrCharterTravel,TravelForCompanions,Form990OfOtherOrganizations,PaymentsForUseOfResidence,PersonalServices,SeverancePaymentInd,SupplementalNonqualRtrPlanInd,EquityBasedCompArrngmInd,CompBasedOnRevenueOfFlngOrgInd,CompBsdOnRevRelatedOrgsInd,CompBsdNetEarnsFlngOrgInd,CompBsdNetEarnsRltdOrgsInd,AnyNonFixedPaymentsInd,InitialContractExceptionInd,CompensationCommitteeInd,BoardOrCommitteeApprovalInd,RebuttablePresumptionProcInd,IndependentConsultantInd,WrittenEmploymentContractInd,CompensationSurveyInd,Form990OfOtherOrganizationsInd,DiscretionarySpendingAcctInd,WrittenPolicyRefTAndEExpnssInd,SubstantiationRequiredInd,TravelForCompanionsInd,IdemnificationGrossUpPmtsInd,ClubDuesOrFeesInd,HousingAllowanceOrResidenceInd,FirstClassOrCharterTravelInd,PersonalServicesInd,PaymentsForUseOfResidenceInd
0,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,X,X,X,False,True,False,False,False,False,False,False,False,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [23]:
set(df.columns.tolist()) - set(concordance['MongoDB_Name'].tolist())

{'URL'}

In [24]:
set(concordance['MongoDB_Name'].tolist()) - set(df.columns.tolist())

set()

##### Check for whether 'Form990ScheduleJPartI' in in the DF
In the initial concordance file, there were three *xpaths* for each variable, e.g., for *SJ_01_PC_SEVERANCE* there were these three:

/Return/ReturnData/IRS990ScheduleJ/SeverancePayment	
/Return/ReturnData/IRS990ScheduleJ/SeverancePaymentInd	
/Return/ReturnData/IRS990ScheduleJ/Form990ScheduleJPartI/SeverancePayment	

None of the data is contained in a 'Form990ScheduleJPartI' heading so I will now delete those rows from the concordance file.

In [25]:
'Form990ScheduleJPartI' in df.columns.tolist()

False

#### Re-read in modified concordance file (after deleting unneeded rows) and then re-verify
- This was done in the original notebook so I didn't need to do it again. 

In [26]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
concordance = pd.read_excel('concordance - Schedule J (Part I).xlsx')
print('# of columns:', len(concordance.columns))
print('# of observations:', len(concordance))
concordance[:2]

Current date and time :  2025-06-20 22:30:04 

# of columns: 15
# of observations: 52
CPU times: total: 31.2 ms
Wall time: 43.4 ms


Unnamed: 0,xpath,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,fill_null,BINARIZE,MongoDB_Name,sub_key,sub_sub_key
0,/Return/ReturnData/IRS990ScheduleJ/ClubDuesOrFees,SJ_01_PC_CLUB_FEES,,,,,Club dues or fees,SCHED-J-PART-01-LINE-1a,PART-01,CheckboxType,,binarize,ClubDuesOrFees,,
1,/Return/ReturnData/IRS990ScheduleJ/ClubDuesOrFeesInd,SJ_01_PC_CLUB_FEES,,,,,Club dues or fees,SCHED-J-PART-01-LINE-1a,PART-01,CheckboxType,,binarize,ClubDuesOrFeesInd,,


In [27]:
set(df.columns.tolist()) - set(concordance['MongoDB_Name'].tolist())

{'URL'}

In [28]:
set(concordance['MongoDB_Name'].tolist()) - set(df.columns.tolist())

set()

### Collapse
"Pandas is warning that, in future versions, the sub-dataframe x you get inside agg_funcs will not include the column(s) you grouped on (variable_name_new)."

In [29]:
def agg_funcs(x):
    names = {
        #'name': x['variable_name_new'].head(1).values[0],
        'original_names':  list(set(x['MongoDB_Name'].tolist())),
        'data_type_xsd': x['data_type_xsd'].head(1).values[0],
        'BINARIZE': x['BINARIZE'].head(1).values[0]
        }
    #THE FOLLOWING SHORTCUT WORKS BUT CHANGES THE ORDER OF THE COLUMNS
    #return pd.Series(names, index = list(names.keys()))
    return pd.Series(names, index=['original_names', 'data_type_xsd', 'BINARIZE'])
new_variables_df = concordance[:].groupby(['variable_name_new']).apply(agg_funcs)
new_variables_df = new_variables_df.reset_index()
print('# of variables:', len(new_variables_df))
new_variables_df[:5]

# of variables: 26


  new_variables_df = concordance[:].groupby(['variable_name_new']).apply(agg_funcs)


Unnamed: 0,variable_name_new,original_names,data_type_xsd,BINARIZE
0,SJ_01_PC_BOARD_APPROVAL,"[BoardOrCommitteeApproval, BoardOrCommitteeApprovalInd]",CheckboxType,binarize
1,SJ_01_PC_CLUB_FEES,"[ClubDuesOrFeesInd, ClubDuesOrFees]",CheckboxType,binarize
2,SJ_01_PC_COMPANION_TRAVEL,"[TravelForCompanionsInd, TravelForCompanions]",CheckboxType,binarize
3,SJ_01_PC_COMPENSATION_COMMITTEE,"[CompensationCommittee, CompensationCommitteeInd]",CheckboxType,binarize
4,SJ_01_PC_COMPENSATION_SURVEY,"[CompensationSurvey, CompensationSurveyInd]",CheckboxType,binarize


##### Option 1

In [31]:
def agg_funcs(x):
    return pd.Series({
        'original_names': list(dict.fromkeys(x['MongoDB_Name'])),  # preserves first-seen order
        'data_type_xsd' : x['data_type_xsd'].iat[0],
        'BINARIZE'      : x['BINARIZE'].iat[0]
    })

new_variables_df = (
    concordance
      .groupby('variable_name_new')          # group on the new name
      .apply(agg_funcs, include_groups=False)  # <-- no warning
      .reset_index()
)

print('# of variables:', len(new_variables_df))
new_variables_df[:5]

# of variables: 26


Unnamed: 0,variable_name_new,original_names,data_type_xsd,BINARIZE
0,SJ_01_PC_BOARD_APPROVAL,"[BoardOrCommitteeApproval, BoardOrCommitteeApprovalInd]",CheckboxType,binarize
1,SJ_01_PC_CLUB_FEES,"[ClubDuesOrFees, ClubDuesOrFeesInd]",CheckboxType,binarize
2,SJ_01_PC_COMPANION_TRAVEL,"[TravelForCompanions, TravelForCompanionsInd]",CheckboxType,binarize
3,SJ_01_PC_COMPENSATION_COMMITTEE,"[CompensationCommittee, CompensationCommitteeInd]",CheckboxType,binarize
4,SJ_01_PC_COMPENSATION_SURVEY,"[CompensationSurvey, CompensationSurveyInd]",CheckboxType,binarize


##### Option 2

In [32]:
new_variables_df = (
    concordance
      .groupby('variable_name_new', as_index=False)
      .agg(
          original_names = ('MongoDB_Name', lambda s: list(dict.fromkeys(s))),
          data_type_xsd  = ('data_type_xsd', 'first'),
          BINARIZE       = ('BINARIZE', 'first')
      )
)

print('# of variables:', len(new_variables_df))
new_variables_df[:5]

# of variables: 26


Unnamed: 0,variable_name_new,original_names,data_type_xsd,BINARIZE
0,SJ_01_PC_BOARD_APPROVAL,"[BoardOrCommitteeApproval, BoardOrCommitteeApprovalInd]",CheckboxType,binarize
1,SJ_01_PC_CLUB_FEES,"[ClubDuesOrFees, ClubDuesOrFeesInd]",CheckboxType,binarize
2,SJ_01_PC_COMPANION_TRAVEL,"[TravelForCompanions, TravelForCompanionsInd]",CheckboxType,binarize
3,SJ_01_PC_COMPENSATION_COMMITTEE,"[CompensationCommittee, CompensationCommitteeInd]",CheckboxType,binarize
4,SJ_01_PC_COMPENSATION_SURVEY,"[CompensationSurvey, CompensationSurveyInd]",CheckboxType,binarize


In [33]:
new_variables_df['len'] = new_variables_df['original_names'].apply(lambda x: len(x))
print(new_variables_df['len'].value_counts(), '\n')
new_variables_df[:4]

len
2    26
Name: count, dtype: int64 



Unnamed: 0,variable_name_new,original_names,data_type_xsd,BINARIZE,len
0,SJ_01_PC_BOARD_APPROVAL,"[BoardOrCommitteeApproval, BoardOrCommitteeApprovalInd]",CheckboxType,binarize,2
1,SJ_01_PC_CLUB_FEES,"[ClubDuesOrFees, ClubDuesOrFeesInd]",CheckboxType,binarize,2
2,SJ_01_PC_COMPANION_TRAVEL,"[TravelForCompanions, TravelForCompanionsInd]",CheckboxType,binarize,2
3,SJ_01_PC_COMPENSATION_COMMITTEE,"[CompensationCommittee, CompensationCommitteeInd]",CheckboxType,binarize,2


# Combine all columns where *len*==2

### Define Function to combine columns
In Python we can thus create a series of functions that can be used as shortcuts. First we'll create a function called 'combine' that will combine two variables. It takes as *inputs* four things: our dataset/dataframe (*df*), the name we'd like for our new variable (*newvar*), the name of the first variable to combine (*var1*), and the name of the second variable to combine (*var2*).

In [34]:
def combine(df, newvar, var1, var2):
    df[newvar] = np.where(df[var1].notnull(), df[var1], df[var2])
    #print(df[newvar].value_counts().head(), '\n')
    #print('# of missing observations:', len(df[df[newvar].isnull()]))
    #print('# of valid observations:', len(df[df[newvar].notnull()]), '\n')  
    #return df.sample(5)[[newvar, var1, var2, 'DLN']] 
    #print(df[[newvar, var1, var2, 'ObjectId']][:5], '\n\n\n')

#### Do initial check to ensure that no row has values in both columns

In [35]:
%%time
for index, row in new_variables_df[new_variables_df['len']==2][:].iterrows():
    #print(row['variable_name_new'])
    print(row['variable_name_new'], row['original_names'][0], row['original_names'][1])
    print('\t\t', len(df[df[row['original_names'][0]].notnull()]))
    print('\t\t', len(df[df[row['original_names'][1]].notnull()]))
    #print(len(df[(df[row['original_names'][0]].isnull()) & (df[row['original_names'][1]].isnull())]), '\n\n')      
    print('OK IF ZERO:', len(df[(df[row['original_names'][0]].notnull()) & (df[row['original_names'][1]].notnull())]), '\n\n')

SJ_01_PC_BOARD_APPROVAL BoardOrCommitteeApproval BoardOrCommitteeApprovalInd
		 82699
		 315609
OK IF ZERO: 0 


SJ_01_PC_CLUB_FEES ClubDuesOrFees ClubDuesOrFeesInd
		 10596
		 29276
OK IF ZERO: 0 


SJ_01_PC_COMPANION_TRAVEL TravelForCompanions TravelForCompanionsInd
		 7291
		 21248
OK IF ZERO: 0 


SJ_01_PC_COMPENSATION_COMMITTEE CompensationCommittee CompensationCommitteeInd
		 46589
		 151930
OK IF ZERO: 0 


SJ_01_PC_COMPENSATION_SURVEY CompensationSurvey CompensationSurveyInd
		 58300
		 196366
OK IF ZERO: 0 


SJ_01_PC_CONSULTANT IndependentConsultant IndependentConsultantInd
		 25949
		 72355
OK IF ZERO: 0 


SJ_01_PC_CONTINGENT_NET_OWN CompBasedNetEarningsFilingOrg CompBsdNetEarnsFlngOrgInd
		 104079
		 445893
OK IF ZERO: 0 


SJ_01_PC_CONTINGENT_NET_RELATED CompBasedNetEarningsRelateOrgs CompBsdNetEarnsRltdOrgsInd
		 104058
		 445845
OK IF ZERO: 0 


SJ_01_PC_CONTINGENT_REV_OWN CompBasedOnRevenueOfFilingOrg CompBasedOnRevenueOfFlngOrgInd
		 104087
		 445923
OK IF ZERO: 0 




### Combine

In [36]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
combo_fails = []
for index, row in new_variables_df[new_variables_df['len']==2][:].iterrows():
    print(row['variable_name_new'], row['original_names'][0], row['original_names'][1])
    try:
        combine(df, row['variable_name_new'], row['original_names'][0], row['original_names'][1])
    except:
        print('\n\n\n\n\n***********issue with variable: ', row['variable_name_new'])
        combo_fails.append(row['variable_name_new'])

print(combo_fails)

Current date and time :  2025-06-20 22:34:23 

SJ_01_PC_BOARD_APPROVAL BoardOrCommitteeApproval BoardOrCommitteeApprovalInd
SJ_01_PC_CLUB_FEES ClubDuesOrFees ClubDuesOrFeesInd
SJ_01_PC_COMPANION_TRAVEL TravelForCompanions TravelForCompanionsInd
SJ_01_PC_COMPENSATION_COMMITTEE CompensationCommittee CompensationCommitteeInd
SJ_01_PC_COMPENSATION_SURVEY CompensationSurvey CompensationSurveyInd
SJ_01_PC_CONSULTANT IndependentConsultant IndependentConsultantInd
SJ_01_PC_CONTINGENT_NET_OWN CompBasedNetEarningsFilingOrg CompBsdNetEarnsFlngOrgInd
SJ_01_PC_CONTINGENT_NET_RELATED CompBasedNetEarningsRelateOrgs CompBsdNetEarnsRltdOrgsInd
SJ_01_PC_CONTINGENT_REV_OWN CompBasedOnRevenueOfFilingOrg CompBasedOnRevenueOfFlngOrgInd
SJ_01_PC_CONTINGENT_REV_RELATED CompBasedOnRevenueRelatedOrgs CompBsdOnRevRelatedOrgsInd
SJ_01_PC_CONTRACT WrittenEmploymentContract WrittenEmploymentContractInd
SJ_01_PC_CONTRACT_EXCEPTION InitialContractException InitialContractExceptionInd
SJ_01_PC_DISCRETIONARY_ACCOUNT Di

### Binarize

In [112]:
#binarize_cols = [c for c in new_variables_df[new_variables_df['data_type_xsd'].isin(['BooleanType', 'CheckboxType'])]['variable_name_new'].tolist()] 
#print(len(binarize_cols))
#print(binarize_cols)

In [37]:
binarize_cols = [c for c in new_variables_df[new_variables_df['BINARIZE']=='binarize']['variable_name_new'].tolist()] 
print(len(binarize_cols))
print(binarize_cols)

26
['SJ_01_PC_BOARD_APPROVAL', 'SJ_01_PC_CLUB_FEES', 'SJ_01_PC_COMPANION_TRAVEL', 'SJ_01_PC_COMPENSATION_COMMITTEE', 'SJ_01_PC_COMPENSATION_SURVEY', 'SJ_01_PC_CONSULTANT', 'SJ_01_PC_CONTINGENT_NET_OWN', 'SJ_01_PC_CONTINGENT_NET_RELATED', 'SJ_01_PC_CONTINGENT_REV_OWN', 'SJ_01_PC_CONTINGENT_REV_RELATED', 'SJ_01_PC_CONTRACT', 'SJ_01_PC_CONTRACT_EXCEPTION', 'SJ_01_PC_DISCRETIONARY_ACCOUNT', 'SJ_01_PC_EQUITY_BASED_COMP', 'SJ_01_PC_FIRST_CLASS_TRAVEL', 'SJ_01_PC_HOME_OFFICE_SUBSIDY', 'SJ_01_PC_HOUSING_ALLOWANCE', 'SJ_01_PC_INDEMNIFICATION', 'SJ_01_PC_NON_FIXED_PAYMENTS', 'SJ_01_PC_OTHER_ORGS_990', 'SJ_01_PC_PERSONAL_SERVICES', 'SJ_01_PC_REBUTTABLE_PRESUMPTION', 'SJ_01_PC_SEVERANCE', 'SJ_01_PC_SUBSTANTIATION_REQUIRED', 'SJ_01_PC_SUPPLEMENTAL_RETIREMENT', 'SJ_01_PC_WRITTEN_POLICY']


In [38]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
for c in binarize_cols[:]:
    print(df[df[c].notnull()][c].value_counts().head(), '\n')

Current date and time :  2025-06-20 22:34:31 

SJ_01_PC_BOARD_APPROVAL
X    398308
Name: count, dtype: int64 

SJ_01_PC_CLUB_FEES
X    39872
Name: count, dtype: int64 

SJ_01_PC_COMPANION_TRAVEL
X    28539
Name: count, dtype: int64 

SJ_01_PC_COMPENSATION_COMMITTEE
X    198519
Name: count, dtype: int64 

SJ_01_PC_COMPENSATION_SURVEY
X    254666
Name: count, dtype: int64 

SJ_01_PC_CONSULTANT
X    98304
Name: count, dtype: int64 

SJ_01_PC_CONTINGENT_NET_OWN
0        344654
false    194652
1          7013
true       3653
Name: count, dtype: int64 

SJ_01_PC_CONTINGENT_NET_RELATED
0        347092
false    195598
1          4571
true       2642
Name: count, dtype: int64 

SJ_01_PC_CONTINGENT_REV_OWN
0        345772
false    195334
1          5892
true       3012
Name: count, dtype: int64 

SJ_01_PC_CONTINGENT_REV_RELATED
0        349646
false    197199
1          2018
true        948
Name: count, dtype: int64 

SJ_01_PC_CONTRACT
X    154351
Name: count, dtype: int64 

SJ_01_PC_CONTRACT_EX

##### BASED ON THE PRECEDING CODE BLOCK, NONE OF THE ABOVE VARIABLES ARE LIKELY TO FAIL THE BINARIZATION PROCESS

In [163]:
#binarize_with_dict_cols = ['F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 
#                           'F9_04_PC_FR_EVENT_INC_GT_15K', 'F9_04_PC_GAMING_INC_GT_15K',
#                           'F9_04_PC_PROF_FR_EXP_GT_15K', 'F9_12_PC_ACCTG_METHOD_OTHER']

In [121]:
#for c in binarize_with_dict_cols[:]:
#    print(df[df[c].notnull()][c].value_counts()[:10], '\n')

In [122]:
#print(len(binarize_cols))
#binarize_cols = list(set(binarize_cols) - set(binarize_with_dict_cols))
#print(len(binarize_cols))

##### Check *F9_12_PC_ACCTG_METHOD_OTHER*, *F9_00_HD_EXEMPT_STATUS_501C*, and *F9_00_HD_INCLUDES_SUBORD_ORGS*
Based on the following frequencies,for *F9_12_PC_ACCTG_METHOD_OTHER* do an *np.where* and make it 'other'. Leave *F9_00_HD_EXEMPT_STATUS_501C* and *F9_00_HD_INCLUDES_SUBORD_ORGS* alone.

In [123]:
#print(df[df['F9_00_HD_EXEMPT_STATUS_501C'].notnull()]['F9_00_HD_EXEMPT_STATUS_501C'].value_counts().head())

#### Fix *F9_00_HD_EXEMPT_STATUS_501C*

In [None]:
"""
def func(x, key1, key2):
    if pd.isnull(x):
        return np.nan
    #else: 
    #    mydict = ast.literal_eval(x)
    elif key1 in x.keys():
        return x[key1]
    elif key2 in x.keys():
        return x[key2]
    else:
        return np.nan
"""        

In [113]:
#df['F9_00_HD_EXEMPT_STATUS_501C'] = df['F9_00_HD_EXEMPT_STATUS_501C'][:].apply(func, 
#                            key1='@organization501cTypeTxt', key2 ='@typeOf501cOrganization')

In [114]:
#df = df.drop('Organization501c_type', 1)

In [124]:
#df['F9_00_HD_EXEMPT_STATUS_501C'].value_counts()[:10]

#### Fix other five variables

In [125]:
#binarize_with_dict_cols

In [121]:
""" 
def func_text(x, key1):
    if pd.isnull(x):
        return np.nan
    #else: 
    #    mydict = ast.literal_eval(x)
    
    elif type(x)==dict: 
        if key1 in x.keys():
            return x[key1]
    else:
        return x
"""         

In [141]:
#df['F9_00_HD_INCLUDES_SUBORD_ORGS'] = df['F9_00_HD_INCLUDES_SUBORD_ORGS'][:].apply(func_text, 
#                            key1='#text')

In [142]:
##df = df.drop('test', 1)

In [126]:
#df['F9_00_HD_INCLUDES_SUBORD_ORGS'].value_counts()

In [127]:
##binarize_with_dict_cols

In [148]:
#df['F9_04_PC_FR_EVENT_INC_GT_15K'] = df['F9_04_PC_FR_EVENT_INC_GT_15K'][:].apply(func_text, 
#                            key1='#text')

In [128]:
#df['F9_04_PC_FR_EVENT_INC_GT_15K'].value_counts()

In [129]:
#for c in binarize_with_dict_cols[:]:
#    print(df[df[c].notnull()][c].value_counts()[:10], '\n')

In [150]:
#df['F9_04_PC_GAMING_INC_GT_15K'] = df['F9_04_PC_GAMING_INC_GT_15K'][:].apply(func_text, 
#                            key1='#text')

In [130]:
#df['F9_04_PC_GAMING_INC_GT_15K'].value_counts()

In [131]:
#df['F9_04_PC_PROF_FR_EXP_GT_15K'] = df['F9_04_PC_PROF_FR_EXP_GT_15K'][:].apply(func_text, 
#                            key1='#text')

In [132]:
#df['F9_04_PC_FR_EVENT_INC_GT_15K'].value_counts()

#### Fix *F9_12_PC_ACCTG_METHOD_OTHER*

In [157]:
"""
def func_text2(x, key1, key2):
    if pd.isnull(x):
        return np.nan
    #else: 
    #    mydict = ast.literal_eval(x)
    
    elif type(x)==dict: 
        if key1 in x.keys():
            return x[key1]
        elif key2 in x.keys():
            return x[key2]
    else:
        return x
"""        

In [160]:
#df['F9_12_PC_ACCTG_METHOD_OTHER'] = df['F9_12_PC_ACCTG_METHOD_OTHER'][:].apply(func_text2, 
##                            key1='@note', key2='@methodOfAccountingOtherDesc')

In [161]:
#df=df.drop('test', 1)

In [133]:
#df['F9_12_PC_ACCTG_METHOD_OTHER'].value_counts()[:10]

##### Remove two variables from *binarize_cols*

In [134]:
#print(len(binarize_cols))
#binarize_cols.remove('F9_12_PC_ACCTG_METHOD_OTHER') 
#binarize_cols.remove('F9_00_HD_EXEMPT_STATUS_501C')
#print(len(binarize_cols))

# Binarize Columns

In [137]:
#for col in binarize_cols:
#    print(df[col].value_counts(), '\n\n')

In [39]:
print(len(binarize_cols))

26


In [40]:
def binarize(df, variable):
    print(df[variable].value_counts(), '\n')
    df[variable] = np.where(df[variable]=='true', 1, df[variable])
    df[variable] = np.where(df[variable]=='false', 0, df[variable])
    df[variable] = np.where(df[variable]=='1', 1, df[variable])
    df[variable] = np.where(df[variable]=='0', 0, df[variable])
    df[variable] = np.where(df[variable]=='X', 1, df[variable])
    print(df[variable].value_counts(), '\n\n')
    return df.sample(10)[['URL', variable]]

In [41]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
for col in binarize_cols:
    binarize(df, col)

Current date and time :  2025-06-20 22:35:02 

SJ_01_PC_BOARD_APPROVAL
X    398308
Name: count, dtype: int64 

SJ_01_PC_BOARD_APPROVAL
1    398308
Name: count, dtype: int64 


SJ_01_PC_CLUB_FEES
X    39872
Name: count, dtype: int64 

SJ_01_PC_CLUB_FEES
1    39872
Name: count, dtype: int64 


SJ_01_PC_COMPANION_TRAVEL
X    28539
Name: count, dtype: int64 

SJ_01_PC_COMPANION_TRAVEL
1    28539
Name: count, dtype: int64 


SJ_01_PC_COMPENSATION_COMMITTEE
X    198519
Name: count, dtype: int64 

SJ_01_PC_COMPENSATION_COMMITTEE
1    198519
Name: count, dtype: int64 


SJ_01_PC_COMPENSATION_SURVEY
X    254666
Name: count, dtype: int64 

SJ_01_PC_COMPENSATION_SURVEY
1    254666
Name: count, dtype: int64 


SJ_01_PC_CONSULTANT
X    98304
Name: count, dtype: int64 

SJ_01_PC_CONSULTANT
1    98304
Name: count, dtype: int64 


SJ_01_PC_CONTINGENT_NET_OWN
0        344654
false    194652
1          7013
true       3653
Name: count, dtype: int64 

SJ_01_PC_CONTINGENT_NET_OWN
0    539306
1     10666
N

In [43]:
df[binarize_cols].sample(10)

Unnamed: 0,SJ_01_PC_BOARD_APPROVAL,SJ_01_PC_CLUB_FEES,SJ_01_PC_COMPANION_TRAVEL,SJ_01_PC_COMPENSATION_COMMITTEE,SJ_01_PC_COMPENSATION_SURVEY,SJ_01_PC_CONSULTANT,SJ_01_PC_CONTINGENT_NET_OWN,SJ_01_PC_CONTINGENT_NET_RELATED,SJ_01_PC_CONTINGENT_REV_OWN,SJ_01_PC_CONTINGENT_REV_RELATED,SJ_01_PC_CONTRACT,SJ_01_PC_CONTRACT_EXCEPTION,SJ_01_PC_DISCRETIONARY_ACCOUNT,SJ_01_PC_EQUITY_BASED_COMP,SJ_01_PC_FIRST_CLASS_TRAVEL,SJ_01_PC_HOME_OFFICE_SUBSIDY,SJ_01_PC_HOUSING_ALLOWANCE,SJ_01_PC_INDEMNIFICATION,SJ_01_PC_NON_FIXED_PAYMENTS,SJ_01_PC_OTHER_ORGS_990,SJ_01_PC_PERSONAL_SERVICES,SJ_01_PC_REBUTTABLE_PRESUMPTION,SJ_01_PC_SEVERANCE,SJ_01_PC_SUBSTANTIATION_REQUIRED,SJ_01_PC_SUPPLEMENTAL_RETIREMENT,SJ_01_PC_WRITTEN_POLICY
680774,,,,,,,0.0,0.0,0.0,0.0,,0.0,,0,,,,,0.0,,,,0,,0,
585880,1.0,,,,,,,,,,1.0,,,0,,,,,,,,,0,,0,
581376,,,,,,,0.0,0.0,0.0,0.0,,0.0,,0,,,,,1.0,,,,1,,1,
266390,,,,,,,,,,,,,,0,,,,,,,,,0,,0,
215627,,,,,,,,,,,,,,0,,,,,,,,,0,,0,
480301,1.0,,,1.0,,,0.0,0.0,0.0,0.0,1.0,0.0,,0,,,,,0.0,1.0,,,0,1.0,0,
441084,1.0,,,1.0,1.0,1.0,0.0,0.0,0.0,0.0,,0.0,,0,,,,,1.0,,,,0,,0,
409942,,,,,,,0.0,0.0,0.0,0.0,,0.0,,0,,,,,0.0,,,,1,,0,
648154,1.0,1.0,,1.0,,,,,,,,,,0,,,,,,,,,0,1.0,0,1.0
641008,1.0,,,,1.0,,0.0,0.0,0.0,0.0,,0.0,,0,,,,,0.0,,,,0,,0,


### Check that total number of values in new variable equal sum of prior 2 variables

In [44]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
for index, row in new_variables_df[new_variables_df['len']==2][:].iterrows():
    #print(row['variable_name_new'])
    print(row['variable_name_new'], row['original_names'][0], row['original_names'][1])
    print(len(df[df[row['original_names'][0]].notnull()]) + len(df[df[row['original_names'][1]].notnull()]))    
    print(len(df[df[row['variable_name_new']].notnull()]), '\n')
    #print(len(df[(df[row['original_names'][0]].notnull()) & (df[row['original_names'][1]].notnull())]), '\n')     

Current date and time :  2025-06-20 22:35:38 

SJ_01_PC_BOARD_APPROVAL BoardOrCommitteeApproval BoardOrCommitteeApprovalInd
398308
398308 

SJ_01_PC_CLUB_FEES ClubDuesOrFees ClubDuesOrFeesInd
39872
39872 

SJ_01_PC_COMPANION_TRAVEL TravelForCompanions TravelForCompanionsInd
28539
28539 

SJ_01_PC_COMPENSATION_COMMITTEE CompensationCommittee CompensationCommitteeInd
198519
198519 

SJ_01_PC_COMPENSATION_SURVEY CompensationSurvey CompensationSurveyInd
254666
254666 

SJ_01_PC_CONSULTANT IndependentConsultant IndependentConsultantInd
98304
98304 

SJ_01_PC_CONTINGENT_NET_OWN CompBasedNetEarningsFilingOrg CompBsdNetEarnsFlngOrgInd
549972
549972 

SJ_01_PC_CONTINGENT_NET_RELATED CompBasedNetEarningsRelateOrgs CompBsdNetEarnsRltdOrgsInd
549903
549903 

SJ_01_PC_CONTINGENT_REV_OWN CompBasedOnRevenueOfFilingOrg CompBasedOnRevenueOfFlngOrgInd
550010
550010 

SJ_01_PC_CONTINGENT_REV_RELATED CompBasedOnRevenueRelatedOrgs CompBsdOnRevRelatedOrgsInd
549811
549811 

SJ_01_PC_CONTRACT WrittenEmployme

<br><br>
From the above we are fine with deleting the 138 variables related to the 69 above variables in *variable_name_new* (numbers from earlier version of notebook).

### Inspect the Combined and Original Variables
Here I'm showing one variable

In [46]:
df[df['SJ_01_PC_SEVERANCE'].notnull()].sample(5)[['SJ_01_PC_SEVERANCE', 'SeverancePaymentInd', 'SeverancePayment']]

Unnamed: 0,SJ_01_PC_SEVERANCE,SeverancePaymentInd,SeverancePayment
647862,0,0,
499612,0,0,
447863,0,0,
553675,0,0,
677132,0,false,


### Drop variables

In [47]:
new_variables_df[new_variables_df['len']!=2]['original_names'].tolist()

[]

In [48]:
new_variables_df[new_variables_df['len']==2]['original_names'].tolist()

[['BoardOrCommitteeApproval', 'BoardOrCommitteeApprovalInd'],
 ['ClubDuesOrFees', 'ClubDuesOrFeesInd'],
 ['TravelForCompanions', 'TravelForCompanionsInd'],
 ['CompensationCommittee', 'CompensationCommitteeInd'],
 ['CompensationSurvey', 'CompensationSurveyInd'],
 ['IndependentConsultant', 'IndependentConsultantInd'],
 ['CompBasedNetEarningsFilingOrg', 'CompBsdNetEarnsFlngOrgInd'],
 ['CompBasedNetEarningsRelateOrgs', 'CompBsdNetEarnsRltdOrgsInd'],
 ['CompBasedOnRevenueOfFilingOrg', 'CompBasedOnRevenueOfFlngOrgInd'],
 ['CompBasedOnRevenueRelatedOrgs', 'CompBsdOnRevRelatedOrgsInd'],
 ['WrittenEmploymentContract', 'WrittenEmploymentContractInd'],
 ['InitialContractException', 'InitialContractExceptionInd'],
 ['DiscretionarySpendingAccount', 'DiscretionarySpendingAcctInd'],
 ['EquityBasedCompArrangement', 'EquityBasedCompArrngmInd'],
 ['FirstClassOrCharterTravel', 'FirstClassOrCharterTravelInd'],
 ['PaymentsForUseOfResidence', 'PaymentsForUseOfResidenceInd'],
 ['HousingAllowanceOrResidence',

In [49]:
flat_list = [item for sublist in new_variables_df[new_variables_df['len']==2]['original_names'].tolist() for item in sublist]
print(len(flat_list))
print(flat_list[:])

52
['BoardOrCommitteeApproval', 'BoardOrCommitteeApprovalInd', 'ClubDuesOrFees', 'ClubDuesOrFeesInd', 'TravelForCompanions', 'TravelForCompanionsInd', 'CompensationCommittee', 'CompensationCommitteeInd', 'CompensationSurvey', 'CompensationSurveyInd', 'IndependentConsultant', 'IndependentConsultantInd', 'CompBasedNetEarningsFilingOrg', 'CompBsdNetEarnsFlngOrgInd', 'CompBasedNetEarningsRelateOrgs', 'CompBsdNetEarnsRltdOrgsInd', 'CompBasedOnRevenueOfFilingOrg', 'CompBasedOnRevenueOfFlngOrgInd', 'CompBasedOnRevenueRelatedOrgs', 'CompBsdOnRevRelatedOrgsInd', 'WrittenEmploymentContract', 'WrittenEmploymentContractInd', 'InitialContractException', 'InitialContractExceptionInd', 'DiscretionarySpendingAccount', 'DiscretionarySpendingAcctInd', 'EquityBasedCompArrangement', 'EquityBasedCompArrngmInd', 'FirstClassOrCharterTravel', 'FirstClassOrCharterTravelInd', 'PaymentsForUseOfResidence', 'PaymentsForUseOfResidenceInd', 'HousingAllowanceOrResidence', 'HousingAllowanceOrResidenceInd', 'Idemnifica

<br> Flatten a list of lists: https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists

In [50]:
print(len([c for c in df.columns.tolist() if c not in flat_list]))
print([c for c in df.columns.tolist() if c not in flat_list])

27
['URL', 'SJ_01_PC_BOARD_APPROVAL', 'SJ_01_PC_CLUB_FEES', 'SJ_01_PC_COMPANION_TRAVEL', 'SJ_01_PC_COMPENSATION_COMMITTEE', 'SJ_01_PC_COMPENSATION_SURVEY', 'SJ_01_PC_CONSULTANT', 'SJ_01_PC_CONTINGENT_NET_OWN', 'SJ_01_PC_CONTINGENT_NET_RELATED', 'SJ_01_PC_CONTINGENT_REV_OWN', 'SJ_01_PC_CONTINGENT_REV_RELATED', 'SJ_01_PC_CONTRACT', 'SJ_01_PC_CONTRACT_EXCEPTION', 'SJ_01_PC_DISCRETIONARY_ACCOUNT', 'SJ_01_PC_EQUITY_BASED_COMP', 'SJ_01_PC_FIRST_CLASS_TRAVEL', 'SJ_01_PC_HOME_OFFICE_SUBSIDY', 'SJ_01_PC_HOUSING_ALLOWANCE', 'SJ_01_PC_INDEMNIFICATION', 'SJ_01_PC_NON_FIXED_PAYMENTS', 'SJ_01_PC_OTHER_ORGS_990', 'SJ_01_PC_PERSONAL_SERVICES', 'SJ_01_PC_REBUTTABLE_PRESUMPTION', 'SJ_01_PC_SEVERANCE', 'SJ_01_PC_SUBSTANTIATION_REQUIRED', 'SJ_01_PC_SUPPLEMENTAL_RETIREMENT', 'SJ_01_PC_WRITTEN_POLICY']


In [51]:
print(len(new_variables_df['variable_name_new'].tolist()))

26


In [52]:
set([c for c in df.columns.tolist() if c not in flat_list]) - set(new_variables_df['variable_name_new'].tolist())

{'URL'}

<br>The following block drops 52 columns

In [53]:
print(len(df.columns))
df = df[[c for c in df.columns.tolist() if c not in flat_list]]
print(len(df.columns))
df[:2]

79
27


Unnamed: 0,URL,SJ_01_PC_BOARD_APPROVAL,SJ_01_PC_CLUB_FEES,SJ_01_PC_COMPANION_TRAVEL,SJ_01_PC_COMPENSATION_COMMITTEE,SJ_01_PC_COMPENSATION_SURVEY,SJ_01_PC_CONSULTANT,SJ_01_PC_CONTINGENT_NET_OWN,SJ_01_PC_CONTINGENT_NET_RELATED,SJ_01_PC_CONTINGENT_REV_OWN,SJ_01_PC_CONTINGENT_REV_RELATED,SJ_01_PC_CONTRACT,SJ_01_PC_CONTRACT_EXCEPTION,SJ_01_PC_DISCRETIONARY_ACCOUNT,SJ_01_PC_EQUITY_BASED_COMP,SJ_01_PC_FIRST_CLASS_TRAVEL,SJ_01_PC_HOME_OFFICE_SUBSIDY,SJ_01_PC_HOUSING_ALLOWANCE,SJ_01_PC_INDEMNIFICATION,SJ_01_PC_NON_FIXED_PAYMENTS,SJ_01_PC_OTHER_ORGS_990,SJ_01_PC_PERSONAL_SERVICES,SJ_01_PC_REBUTTABLE_PRESUMPTION,SJ_01_PC_SEVERANCE,SJ_01_PC_SUBSTANTIATION_REQUIRED,SJ_01_PC_SUPPLEMENTAL_RETIREMENT,SJ_01_PC_WRITTEN_POLICY
0,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,1,,,1,1,,0,0,0,0,,0,,0,,,,,0,,,0,0,,1,
1,https://s3.amazonaws.com/irs-form-990/201113139349301316_public.xml,1,,,1,1,,0,0,0,0,,0,,0,,,,,0,,,0,0,,1,


##### Verify

In [54]:
print(len(df.columns.tolist()))

27


In [55]:
set(df.columns.tolist()) - set(new_variables_df['variable_name_new'].tolist())

{'URL'}

In [56]:
set(new_variables_df['variable_name_new'].tolist()) - set(df.columns.tolist())

set()

In [57]:
df.dtypes

URL                                 object
SJ_01_PC_BOARD_APPROVAL             object
SJ_01_PC_CLUB_FEES                  object
SJ_01_PC_COMPANION_TRAVEL           object
SJ_01_PC_COMPENSATION_COMMITTEE     object
SJ_01_PC_COMPENSATION_SURVEY        object
SJ_01_PC_CONSULTANT                 object
SJ_01_PC_CONTINGENT_NET_OWN         object
SJ_01_PC_CONTINGENT_NET_RELATED     object
SJ_01_PC_CONTINGENT_REV_OWN         object
SJ_01_PC_CONTINGENT_REV_RELATED     object
SJ_01_PC_CONTRACT                   object
SJ_01_PC_CONTRACT_EXCEPTION         object
SJ_01_PC_DISCRETIONARY_ACCOUNT      object
SJ_01_PC_EQUITY_BASED_COMP          object
SJ_01_PC_FIRST_CLASS_TRAVEL         object
SJ_01_PC_HOME_OFFICE_SUBSIDY        object
SJ_01_PC_HOUSING_ALLOWANCE          object
SJ_01_PC_INDEMNIFICATION            object
SJ_01_PC_NON_FIXED_PAYMENTS         object
SJ_01_PC_OTHER_ORGS_990             object
SJ_01_PC_PERSONAL_SERVICES          object
SJ_01_PC_REBUTTABLE_PRESUMPTION     object
SJ_01_PC_SE

In [58]:
len(new_variables_df['variable_name_new'].tolist())
new_variables_df['variable_name_new'].tolist()[:2]

['SJ_01_PC_BOARD_APPROVAL', 'SJ_01_PC_CLUB_FEES']

In [59]:
df[new_variables_df['variable_name_new'].tolist()] = df[new_variables_df['variable_name_new'].tolist()].apply(pd.to_numeric, errors='coerce')

In [60]:
df.dtypes

URL                                  object
SJ_01_PC_BOARD_APPROVAL             float64
SJ_01_PC_CLUB_FEES                  float64
SJ_01_PC_COMPANION_TRAVEL           float64
SJ_01_PC_COMPENSATION_COMMITTEE     float64
SJ_01_PC_COMPENSATION_SURVEY        float64
SJ_01_PC_CONSULTANT                 float64
SJ_01_PC_CONTINGENT_NET_OWN         float64
SJ_01_PC_CONTINGENT_NET_RELATED     float64
SJ_01_PC_CONTINGENT_REV_OWN         float64
SJ_01_PC_CONTINGENT_REV_RELATED     float64
SJ_01_PC_CONTRACT                   float64
SJ_01_PC_CONTRACT_EXCEPTION         float64
SJ_01_PC_DISCRETIONARY_ACCOUNT      float64
SJ_01_PC_EQUITY_BASED_COMP          float64
SJ_01_PC_FIRST_CLASS_TRAVEL         float64
SJ_01_PC_HOME_OFFICE_SUBSIDY        float64
SJ_01_PC_HOUSING_ALLOWANCE          float64
SJ_01_PC_INDEMNIFICATION            float64
SJ_01_PC_NON_FIXED_PAYMENTS         float64
SJ_01_PC_OTHER_ORGS_990             float64
SJ_01_PC_PERSONAL_SERVICES          float64
SJ_01_PC_REBUTTABLE_PRESUMPTION 

In [61]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SJ_01_PC_BOARD_APPROVAL,398308.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
SJ_01_PC_CLUB_FEES,39872.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
SJ_01_PC_COMPANION_TRAVEL,28539.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
SJ_01_PC_COMPENSATION_COMMITTEE,198519.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
SJ_01_PC_COMPENSATION_SURVEY,254666.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
SJ_01_PC_CONSULTANT,98304.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
SJ_01_PC_CONTINGENT_NET_OWN,549972.0,0.02,0.14,0.0,0.0,0.0,0.0,1.0
SJ_01_PC_CONTINGENT_NET_RELATED,549903.0,0.01,0.11,0.0,0.0,0.0,0.0,1.0
SJ_01_PC_CONTINGENT_REV_OWN,550010.0,0.02,0.13,0.0,0.0,0.0,0.0,1.0
SJ_01_PC_CONTINGENT_REV_RELATED,549811.0,0.01,0.07,0.0,0.0,0.0,0.0,1.0


##### Save DF

In [62]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df.to_pickle('Schedule J (Part I) - parsed.pkl.gz', compression='gzip')

Current date and time :  2025-06-20 22:46:28 

CPU times: total: 24 s
Wall time: 24.6 s
