# Overview

Based off *IRS 990 e-File Data -- Excise Tax Project (1) -- Read in Schedule J from MongoDB, Flatten Schedule J, and Export Parts I-III.ipynb*

In this notebook I read in Schedule J from MongoDB, Flatten the Schedule J columns, and Export Parts I-III in three separate files. 

Note that I create a combined file out of AWS-based filings in ``filings_990`` and the new IRS-based filings in ``filings_990_y``

Saved files:
- *Schedule J (non-flattened).pkl.gz*
- *Schedule J (flattened).pkl.gz*
- *Schedule J (Part II).pkl.gz*
- *Schedule J (Part III).pkl.gz*
- *Schedule J (Part I).pkl.gz*

Note that in the original notebook -- *IRS 990 e-File Data -- Excise Tax Project (1) -- Read in Schedule J from MongoDB, Flatten Schedule J, and Export Parts I-III.ipynb* I also read in a Schedule J concordance file, created *MongoDB_Name* column for new variables, then saved the following version:
- *concordance - Schedule J.xlsx*

I had then also opened the file in Excel and created separate versions for Part I, Part II, and Part III, respectively. I  then tested the files in the follow-up notebooks and finalized the concordance files and saved new Excel and pickled versions. 

*Note*:
- When flattening *Schedule J* using *json_normalize*, be sure to set 'max_level=0'. Otherwise, what was happening is that, if there are multiple people in Part II (and thus a *list*), then it will put the data in the *Form990ScheduleJPartII* or *RltdOrgOfficerTrstKeyEmplGrp* column. However, if there is only one person (and thus a *dictionary*), then it creates separate columns such as *Form990ScheduleJPartII.TotalCompensationFilingOrg* and *Form990ScheduleJPartII.TotalCompensationRelatedOrgs*.

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

In [2]:
print(pd.__version__)

2.2.2


In [3]:
from platform import python_version
print(python_version())

3.10.11


In [4]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)

In [5]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [6]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
warnings.simplefilter(action='ignore', category=pd.errors.SettingWithCopyWarning)

#### Set working directory

In [7]:
#cd '/Users/gsaxton/Dropbox/990 e-file data'

In [8]:
pwd

'C:\\Users\\Gregory\\Jupyter_Notebooks'

In [9]:
cd "C:\\Users\\Gregory\\IRS 990 Control Variables\\"

C:\Users\Gregory\IRS 990 Control Variables


# Load Packages and Connect to MongoDB

In [10]:
import sys
import time
import json

In [14]:
import pymongo
from pymongo import MongoClient
client = MongoClient()

In [15]:
print(pymongo.__version__)

4.3.3


In [16]:
MongoClient().list_database_names()

['ICIJ',
 'OWS',
 'SMC',
 'admin',
 'cashtags',
 'config',
 'enron',
 'irs_990_db',
 'irs_990_db_v2',
 'local',
 'paradisepapers',
 'sec',
 'sp1500',
 'sp500']

In [17]:
# DEFINE MY mongoDB DATABASE
db = client['irs_990_db']

# DEFINE MY COLLECTION HOUSING 990 DATA
filings_990 = db['filings_990']

In [18]:
#dfj[dfj['URL']=='https://s3.amazonaws.com/irs-form-990/201423149349302877_public.xml']

In [19]:
db.filings_990.find_one({'URL' : "https://s3.amazonaws.com/irs-form-990/201423149349302877_public.xml" })

{'_id': ObjectId('5d04714578ffca27b430a8a1'),
 'OrganizationName': 'COLUMBUS ELECTRIC COOPERATIVE INC',
 'ObjectId': '201423149349302877',
 'URL': 'https://s3.amazonaws.com/irs-form-990/201423149349302877_public.xml',
 'SubmittedOn': '2014-12-04',
 'DLN': '93493314028774',
 'LastUpdated': '2016-03-21T17:23:53',
 'TaxPeriod': '201312',
 'FormType': '990',
 'EIN': '850094212',
 '@xmlns': 'http://www.irs.gov/efile',
 '@xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance',
 '@xsi:schemaLocation': 'http://www.irs.gov/efile',
 '@returnVersion': '2013v3.0',
 'ReturnHeader': {'@binaryAttachmentCnt': '0',
  'ReturnTs': '2014-11-10T17:15:36-06:00',
  'TaxPeriodEndDt': '2013-12-31',
  'PreparerFirmGrp': {'PreparerFirmEIN': '750882037',
   'PreparerFirmName': {'BusinessNameLine1': 'BOLINGER SEGARS GILBERT AND MOSS LLP'},
   'PreparerUSAddress': {'AddressLine1': '8215 NASHVILLE AVENUE',
    'City': 'LUBBOCK',
    'State': 'TX',
    'ZIPCode': '79423'}},
  'ReturnTypeCd': '990',
  'TaxPeriodBegin

In [20]:
db.filings_990.find_one({'URL' : "https://s3.amazonaws.com/irs-form-990/201401349349308135_public.xml" })

{'_id': ObjectId('5d047fcb78ffca27b430d4be'),
 'OrganizationName': 'HARVEY L MILLER SUPPORTING FOUNDATION',
 'ObjectId': '201401349349308135',
 'URL': 'https://s3.amazonaws.com/irs-form-990/201401349349308135_public.xml',
 'SubmittedOn': '2014-06-26',
 'DLN': '93493134081354',
 'LastUpdated': '2016-03-21T17:23:53',
 'TaxPeriod': '201306',
 'FormType': '990',
 'EIN': '900187252',
 '@xmlns': 'http://www.irs.gov/efile',
 '@xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance',
 '@xsi:schemaLocation': 'http://www.irs.gov/efile',
 '@returnVersion': '2012v2.1',
 'ReturnHeader': {'@binaryAttachmentCount': '0',
  'Timestamp': '2014-05-14T19:16:02-05:00',
  'TaxPeriodEndDate': '2013-06-30',
  'PreparerFirm': {'EIN': '420714325',
   'PreparerFirmBusinessName': {'BusinessNameLine1': 'MCGLADREY LLP'},
   'PreparerFirmUSAddress': {'AddressLine1': '1 S WACKER DRIVE STE 800',
    'City': 'CHICAGO',
    'State': 'IL',
    'ZIPCode': '60606'}},
  'ReturnType': '990',
  'TaxPeriodBeginDate': '2012-07-

In [21]:
db.filings_990.find_one({'URL' : "https://s3.amazonaws.com/irs-form-990/201100289349300910_public.xml" })

{'_id': ObjectId('5d01cfed78ffca27b428aa97'),
 'OrganizationName': 'ASSEMBLEIA DE DEUS MINISTERIO BELEM CHUR',
 'ObjectId': '201100289349300910',
 'URL': 'https://s3.amazonaws.com/irs-form-990/201100289349300910_public.xml',
 'SubmittedOn': '2011-09-22',
 'DLN': '93493028009101',
 'LastUpdated': '2016-03-21T17:23:53',
 'TaxPeriod': '201012',
 'FormType': '990',
 'EIN': '954745380',
 '@xmlns': 'http://www.irs.gov/efile',
 '@returnVersion': '2010v3.2',
 'ReturnHeader': {'@binaryAttachmentCount': '0',
  'Timestamp': '2011-01-28T13:07:07-08:00',
  'TaxPeriodEndDate': '2010-12-31',
  'PreparerFirm': {'PreparerFirmBusinessName': {'BusinessNameLine1': 'VIRULAS GENERAL OFFICE'},
   'PreparerFirmUSAddress': {'AddressLine1': '4138 ATLANTIC AVE',
    'City': 'Long Beach',
    'State': 'CA',
    'ZIPCode': '90807'}},
  'ReturnType': '990',
  'TaxPeriodBeginDate': '2010-01-01',
  'Filer': {'EIN': '954745380',
   'Name': {'BusinessNameLine1': 'ASSEMBLEIA DE DEUS MINISTERIO BELEM CHUR'},
   'NameCont

In [22]:
for f in filings_990.find({})[2:3]:
    print(sorted(f['IRS990ScheduleJ']))

['@documentId', '@softwareId', '@softwareVersion', 'AnyNonFixedPayments', 'BoardOrCommitteeApproval', 'CompBasedNetEarningsFilingOrg', 'CompBasedNetEarningsRelateOrgs', 'CompBasedOnRevenueOfFilingOrg', 'CompBasedOnRevenueRelatedOrgs', 'CompensationCommittee', 'CompensationSurvey', 'EquityBasedCompArrangement', 'Form990ScheduleJPartII', 'InitialContractException', 'RebuttablePresumptionProcedure', 'SeverancePayment', 'SupplementalNonqualRetirePlan']


In [23]:
for f in filings_990.find({})[:1]:
    print(sorted(f.keys()))

['@documentCount', '@documentId', '@referenceDocumentId', '@returnVersion', '@xmlns', '@xmlns:xsi', '@xsi:schemaLocation', 'AccountantCompileOrReview', 'AccountsPayableAccruedExpenses', 'AccountsReceivable', 'ActivitiesConductedPartnership', 'ActivityOrMissionDescription', 'AddressChange', 'AddressPrincipalOfficerUS', 'AllOtherContributions', 'AllOtherExpenses', 'AnnualDisclosureCoveredPersons', 'AuditCommittee', 'BenefitsPaidToMembersCY', 'BenefitsPaidToMembersPriorYear', 'BsnssRltnshpThruFamilyMember', 'BsnssRltnshpWithOrganization', 'ChangesToOrganizingDocs', 'CollectionsOfArt', 'CompensationFromOtherSources', 'CompensationProcessCEO', 'CompensationProcessOther', 'ComplianceWithBackupWitholding', 'ConflictOfInterestPolicy', 'ConservationEasements', 'ConsolidatedAuditFinancialStmt', 'ContributionsGrantsCurrentYear', 'ContributionsGrantsPriorYear', 'CreditCounseling', 'DLN', 'DecisionsSubjectToApproval', 'DeductibleContributionsOfArt', 'DeductibleNonCashContributions', 'DelegationOfMa

In [24]:
list(db.filings_990.index_information())

['_id_', 'URL_1']

In [25]:
filings_990.estimated_document_count()

3469008

In [26]:
df = pd.DataFrame(list(filings_990.find({}, {'_id': 0, 'EIN': 1, 'OrganizationName': 1, 'DLN': 1, 'URL': 1, 
    'TaxPeriod': 1, 
    'IRS990ScheduleJ': 1})[:2]))
df

Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,IRS990ScheduleJ
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,
1,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,201106,581805618,"{'@documentId': 'IRS990ScheduleJ', '@softwareId': '10000105', '@softwareVersion': '2010v3.2', 'CompensationCommittee': 'X', 'CompensationSurvey': 'X', 'BoardOrCommitteeApproval': 'X', 'SeverancePayment': 'false', 'SupplementalNonqualRetirePlan': ..."


In [27]:
def iterator2dataframe(iterator, chunk_size: int):
    #Turn an iterator into multiple small pandas.DataFrame
    #This is a balance between memory and efficiency
    records = []
    frames = []
    for i, record in enumerate(iterator):
        records.append(record)
        if i % chunk_size == chunk_size - 1:
            frames.append(pd.DataFrame(records))
            records = []
    if records:
        frames.append(pd.DataFrame(records))
    return pd.concat(frames, sort=False) if frames else pd.DataFrame()

##### Test for Schedule I

In [28]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
dfi = iterator2dataframe(filings_990.find({}, {'_id': 0, 'URL': 1,
    'IRS990ScheduleI': 1}), 1000)
print("Number of columns:", len(dfi.columns))
print("Number of observations:", len(dfi))
dfi[:2] 

Current date and time :  2025-06-20 16:26:55 

Number of columns: 2
Number of observations: 3469008
CPU times: total: 1min 27s
Wall time: 5min 46s


Unnamed: 0,URL,IRS990ScheduleI
0,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,"{'@documentId': 'RetDoc1041900001', 'RecordsMaintained': '1', 'RecipientTable': [{'RecipientNameBusiness': {'BusinessNameLine1': 'RMH - WILMINGTON DE'}, 'AddressUS': {'AddressLine1': '1901 ROCKLAND ROAD', 'City': 'WILMINGTON', 'State': 'DE', 'ZIP..."
1,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,


In [29]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
print(len(dfi[dfi['IRS990ScheduleI'].isnull()]))
print(len(dfi[dfi['IRS990ScheduleI'].notnull()]))

Current date and time :  2025-06-20 16:34:49 

2687713
781295
CPU times: total: 828 ms
Wall time: 881 ms


#### Schedule J

In [30]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df = iterator2dataframe(filings_990.find({}, {'_id': 0, 'EIN': 1, 'OrganizationName': 1, 'DLN': 1, 'URL': 1,
    'IRS990ScheduleJ': 1}), 1000)
print("Number of columns:", len(df.columns))
print("Number of observations:", len(df))
df[:2]  

Current date and time :  2025-06-20 16:35:00 

Number of columns: 5
Number of observations: 3469008
CPU times: total: 48.5 s
Wall time: 2min 17s


Unnamed: 0,OrganizationName,URL,DLN,EIN,IRS990ScheduleJ
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,232705170,
1,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,581805618,"{'@documentId': 'IRS990ScheduleJ', '@softwareId': '10000105', '@softwareVersion': '2010v3.2', 'CompensationCommittee': 'X', 'CompensationSurvey': 'X', 'BoardOrCommitteeApproval': 'X', 'SeverancePayment': 'false', 'SupplementalNonqualRetirePlan': ..."


In [31]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
print(len(df[df['IRS990ScheduleJ'].isnull()]))
print(len(df[df['IRS990ScheduleJ'].notnull()]))

Current date and time :  2025-06-20 16:41:07 

2695894
773114
CPU times: total: 1.73 s
Wall time: 1.87 s


In [32]:
df[df['IRS990ScheduleJ'].isnull()].sample(10)

Unnamed: 0,OrganizationName,URL,DLN,EIN,IRS990ScheduleJ
620,,https://s3.amazonaws.com/irs-form-990/202243199349302839_public.xml,,,
894,AMERICAN LEGION JOE HOOPER POST 209,https://s3.amazonaws.com/irs-form-990/201521359349306062_public.xml,93493135060625.0,916075472.0,
691,,https://s3.amazonaws.com/irs-form-990/202311749349300306_public.xml,,,
269,,https://s3.amazonaws.com/irs-form-990/202231789349301013_public.xml,,,
85,CAUSE FOR PAWZ INC,https://s3.amazonaws.com/irs-form-990/201321939349301252_public.xml,93493193012523.0,273028992.0,
540,CAROLINA SHAG CLUB INC,https://s3.amazonaws.com/irs-form-990/201710719349300151_public.xml,93493071001517.0,570970084.0,
868,MOBILE MEDICAL DISASTER RELIEF INC,https://s3.amazonaws.com/irs-form-990/201323169349304142_public.xml,93493316041423.0,300345964.0,
318,SEAL LEGACY FOUNDATION INC,https://s3.amazonaws.com/irs-form-990/201543209349313729_public.xml,93493320137295.0,453117712.0,
636,,https://s3.amazonaws.com/irs-form-990/202123099349301497_public.xml,,,
633,NATIONAL REGISTRY OF REHABILITATION TECHNOLOGY SUPPLIERS INC,https://s3.amazonaws.com/irs-form-990/201841349349304169_public.xml,93493134041698.0,541648579.0,


In [33]:
df[df['IRS990ScheduleJ'].notnull()].sample(5)

Unnamed: 0,OrganizationName,URL,DLN,EIN,IRS990ScheduleJ
558,,https://s3.amazonaws.com/irs-form-990/202341099349300129_public.xml,,,"{'@documentId': 'RetDoc1042400001', 'CompensationCommitteeInd': 'X', 'IndependentConsultantInd': 'X', 'Form990OfOtherOrganizationsInd': 'X', 'WrittenEmploymentContractInd': 'X', 'CompensationSurveyInd': 'X', 'BoardOrCommitteeApprovalInd': 'X', 'S..."
503,FAMILY SERVICES INC,https://s3.amazonaws.com/irs-form-990/201501909349300615_public.xml,93493190006155.0,930991864.0,"{'@documentId': '00000005', 'SeverancePaymentInd': 'false', 'SupplementalNonqualRtrPlanInd': 'false', 'EquityBasedCompArrngmInd': 'false', 'CompBasedOnRevenueOfFlngOrgInd': 'false', 'CompBsdOnRevRelatedOrgsInd': 'false', 'CompBsdNetEarnsFlngOrgIn..."
691,,https://s3.amazonaws.com/irs-form-990/202243349349300739_public.xml,,,"{'@documentId': 'RetDoc1042400001', 'BoardOrCommitteeApprovalInd': 'X', 'SeverancePaymentInd': '0', 'SupplementalNonqualRtrPlanInd': '0', 'EquityBasedCompArrngmInd': '0', 'CompBasedOnRevenueOfFlngOrgInd': '0', 'CompBsdOnRevRelatedOrgsInd': '0', '..."
776,,https://s3.amazonaws.com/irs-form-990/202212999349301731_public.xml,,,"{'@documentId': 'RetDoc1042400001', 'SeverancePaymentInd': '0', 'SupplementalNonqualRtrPlanInd': '0', 'EquityBasedCompArrngmInd': '0', 'CompBasedOnRevenueOfFlngOrgInd': '0', 'CompBsdOnRevRelatedOrgsInd': '0', 'CompBsdNetEarnsFlngOrgInd': '0', 'Co..."
708,GLOBAL OUTREACH INTERNATIONAL INC,https://s3.amazonaws.com/irs-form-990/202033149349303788_public.xml,93493314037880.0,481256219.0,"{'@documentId': 'RetDoc1042400001', 'Form990OfOtherOrganizationsInd': 'X', 'CompensationSurveyInd': 'X', 'BoardOrCommitteeApprovalInd': 'X', 'SeverancePaymentInd': '0', 'SupplementalNonqualRtrPlanInd': '0', 'EquityBasedCompArrngmInd': '0', 'CompB..."


#### Limit DF to non-null Schedule J

In [34]:
dfj = df[df['IRS990ScheduleJ'].notnull()]
print(len(dfj))
dfj.sample(2)

773114


Unnamed: 0,OrganizationName,URL,DLN,EIN,IRS990ScheduleJ
873,IBEW 654 H AND W FUND TRUSTEES,https://s3.amazonaws.com/irs-form-990/201912749349300601_public.xml,93493274006019,231613860,"{'@documentId': 'RetDoc1042400001', 'SeverancePaymentInd': '0', 'SupplementalNonqualRtrPlanInd': '0', 'EquityBasedCompArrngmInd': '0', 'RltdOrgOfficerTrstKeyEmplGrp': [{'PersonNm': 'WILLIAM ADAMS', 'TitleTxt': 'UNION TRUSTEE', 'BaseCompensationFi..."
646,SNAME PROPERTIES INC,https://s3.amazonaws.com/irs-form-990/201403169349303545_public.xml,93493316035454,222807776,"{'@documentId': 'R000003', '@softwareId': '13000241', '@softwareVersion': 'v1.00', 'CompensationCommitteeInd': 'X', 'SeverancePaymentInd': '0', 'SupplementalNonqualRtrPlanInd': '0', 'EquityBasedCompArrngmInd': '0', 'RltdOrgOfficerTrstKeyEmplGrp':..."


#### Save DF

In [35]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
dfj.to_pickle('Schedule J (non-flattened).pkl.gz', compression='gzip')

Current date and time :  2025-06-20 16:41:58 

CPU times: total: 2min 8s
Wall time: 2min 12s


In [131]:
#%%time
#import datetime
#print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
#dfj = pd.read_pickle('Excise Tax Project - Schedule J (non-flattened).pkl')
#print(len(dfj))
#dfj[:1]

430161
Wall time: 18.3 s


Unnamed: 0,OrganizationName,URL,DLN,EIN,IRS990ScheduleJ
1,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,581805618,"{'@documentId': 'IRS990ScheduleJ', '@softwareId': '10000105', '@softwareVersion': '2010v3.2', 'CompensationCommittee': 'X', 'CompensationSurvey': 'X', 'BoardOrCommitteeApproval': 'X', 'SeverancePayment': 'false', 'SupplementalNonqualRetirePlan': 'true', 'EquityBasedCompArrangement': 'false', 'CompBasedOnRevenueOfFilingOrg': 'false', 'CompBasedOnRevenueRelatedOrgs': 'false', 'CompBasedNetEarningsFilingOrg': 'false', 'CompBasedNetEarningsRelateOrgs': 'false', 'AnyNonFixedPayments': 'false', 'I..."


# Flatten Schedule J

In [36]:
pd.set_option('max_colwidth', 500)

In [38]:
df[df['IRS990ScheduleJ'].notnull()][['URL', 'IRS990ScheduleJ']][:1]

Unnamed: 0,URL,IRS990ScheduleJ
1,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,"{'@documentId': 'IRS990ScheduleJ', '@softwareId': '10000105', '@softwareVersion': '2010v3.2', 'CompensationCommittee': 'X', 'CompensationSurvey': 'X', 'BoardOrCommitteeApproval': 'X', 'SeverancePayment': 'false', 'SupplementalNonqualRetirePlan': 'true', 'EquityBasedCompArrangement': 'false', 'CompBasedOnRevenueOfFilingOrg': 'false', 'CompBasedOnRevenueRelatedOrgs': 'false', 'CompBasedNetEarningsFilingOrg': 'false', 'CompBasedNetEarningsRelateOrgs': 'false', 'AnyNonFixedPayments': 'false', 'I..."


In [159]:
#%%time
#dfj = pd.concat([dfj.drop(['IRS990ScheduleJ'], axis=1), dfj['IRS990ScheduleJ'].apply(pd.Series)], axis=1)
#dfj[:1]

### Process All Filings

In [None]:
# flatten the list-like column into a DataFrame
j = pd.json_normalize(dfj["IRS990ScheduleJ"], max_level=0)

In [42]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
#from pandas.io.json import json_normalize
j = pd.json_normalize(dfj['IRS990ScheduleJ'][:], max_level=0)
print(len(j.columns))
print(len(j))
j[:1]

Current date and time :  2025-06-20 16:47:50 

61
773114
CPU times: total: 1min 31s
Wall time: 1min 44s


Unnamed: 0,@documentId,@softwareId,@softwareVersion,CompensationCommittee,CompensationSurvey,BoardOrCommitteeApproval,SeverancePayment,SupplementalNonqualRetirePlan,EquityBasedCompArrangement,CompBasedOnRevenueOfFilingOrg,CompBasedOnRevenueRelatedOrgs,CompBasedNetEarningsFilingOrg,CompBasedNetEarningsRelateOrgs,AnyNonFixedPayments,InitialContractException,RebuttablePresumptionProcedure,Form990ScheduleJPartII,IndependentConsultant,WrittenEmploymentContract,Form990ScheduleJPartIII,HousingAllowanceOrResidence,WrittenPolicyReTAndEExpenses,SubstantiationRequired,IdemnificationGrossUpPayments,DiscretionarySpendingAccount,ClubDuesOrFees,FirstClassOrCharterTravel,TravelForCompanions,Form990OfOtherOrganizations,PaymentsForUseOfResidence,PersonalServices,SeverancePaymentInd,SupplementalNonqualRtrPlanInd,EquityBasedCompArrngmInd,CompBasedOnRevenueOfFlngOrgInd,CompBsdOnRevRelatedOrgsInd,CompBsdNetEarnsFlngOrgInd,CompBsdNetEarnsRltdOrgsInd,AnyNonFixedPaymentsInd,InitialContractExceptionInd,RltdOrgOfficerTrstKeyEmplGrp,CompensationCommitteeInd,BoardOrCommitteeApprovalInd,RebuttablePresumptionProcInd,IndependentConsultantInd,WrittenEmploymentContractInd,CompensationSurveyInd,SupplementalInformationDetail,Form990OfOtherOrganizationsInd,DiscretionarySpendingAcctInd,WrittenPolicyRefTAndEExpnssInd,SubstantiationRequiredInd,TravelForCompanionsInd,IdemnificationGrossUpPmtsInd,ClubDuesOrFeesInd,HousingAllowanceOrResidenceInd,FirstClassOrCharterTravelInd,PersonalServicesInd,PaymentsForUseOfResidenceInd,@documentName,@softwareVersionNum
0,IRS990ScheduleJ,10000105,2010v3.2,X,X,X,False,True,False,False,False,False,False,False,False,False,"[{'NamePerson': 'THOMAS D TURNBULL', 'CompBasedOnRelatedOrgs': '100712', 'BonusRelatedOrgs': '790', 'OtherCompensationRelatedOrgs': '1257', 'DeferredCompRelatedOrgs': '54308', 'NontaxableBenefitsRelatedOrgs': '62342', 'TotalCompensationRelatedOrgs': '219409'}, {'NamePerson': 'RONALD W PATTERSON', 'CompBasedOnRelatedOrgs': '192455', 'BonusRelatedOrgs': '814', 'OtherCompensationRelatedOrgs': '2071', 'DeferredCompRelatedOrgs': '17271', 'NontaxableBenefitsRelatedOrgs': '23201', 'TotalCompensatio...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [43]:
jcols = j.columns.tolist()
jcols

['@documentId',
 '@softwareId',
 '@softwareVersion',
 'CompensationCommittee',
 'CompensationSurvey',
 'BoardOrCommitteeApproval',
 'SeverancePayment',
 'SupplementalNonqualRetirePlan',
 'EquityBasedCompArrangement',
 'CompBasedOnRevenueOfFilingOrg',
 'CompBasedOnRevenueRelatedOrgs',
 'CompBasedNetEarningsFilingOrg',
 'CompBasedNetEarningsRelateOrgs',
 'AnyNonFixedPayments',
 'InitialContractException',
 'RebuttablePresumptionProcedure',
 'Form990ScheduleJPartII',
 'IndependentConsultant',
 'WrittenEmploymentContract',
 'Form990ScheduleJPartIII',
 'HousingAllowanceOrResidence',
 'WrittenPolicyReTAndEExpenses',
 'SubstantiationRequired',
 'IdemnificationGrossUpPayments',
 'DiscretionarySpendingAccount',
 'ClubDuesOrFees',
 'FirstClassOrCharterTravel',
 'TravelForCompanions',
 'Form990OfOtherOrganizations',
 'PaymentsForUseOfResidence',
 'PersonalServices',
 'SeverancePaymentInd',
 'SupplementalNonqualRtrPlanInd',
 'EquityBasedCompArrngmInd',
 'CompBasedOnRevenueOfFlngOrgInd',
 'CompBsdO

In [44]:
set(j.columns) - set(jcols)

set()

In [45]:
set(jcols) - set(j.columns) 

set()

In [18]:
#%%time
#from pandas.io.json import json_normalize
#j = pd.json_normalize(dfj['IRS990ScheduleJ'][:])
#print(len(j))
#j[:1]

430161
Wall time: 2.01 ms


Unnamed: 0,@documentId,@softwareId,@softwareVersion,CompensationCommittee,CompensationSurvey,BoardOrCommitteeApproval,SeverancePayment,SupplementalNonqualRetirePlan,EquityBasedCompArrangement,CompBasedOnRevenueOfFilingOrg,CompBasedOnRevenueRelatedOrgs,CompBasedNetEarningsFilingOrg,CompBasedNetEarningsRelateOrgs,AnyNonFixedPayments,InitialContractException,RebuttablePresumptionProcedure,Form990ScheduleJPartII,IndependentConsultant,WrittenEmploymentContract,Form990ScheduleJPartIII,Form990ScheduleJPartII.NamePerson,Form990ScheduleJPartII.BaseCompensationFilingOrg,Form990ScheduleJPartII.CompBasedOnRelatedOrgs,Form990ScheduleJPartII.BonusFilingOrg,Form990ScheduleJPartII.BonusRelatedOrgs,Form990ScheduleJPartII.OtherCompensationFilingOrg,Form990ScheduleJPartII.OtherCompensationRelatedOrgs,Form990ScheduleJPartII.DeferredCompFilingOrg,Form990ScheduleJPartII.DeferredCompRelatedOrgs,Form990ScheduleJPartII.NontaxableBenefitsFilingOrg,Form990ScheduleJPartII.NontaxableBenefitsRelatedOrgs,Form990ScheduleJPartII.TotalCompensationFilingOrg,Form990ScheduleJPartII.TotalCompensationRelatedOrgs,Form990ScheduleJPartII.CompReportPrior990FilingOrg,Form990ScheduleJPartII.CompReportPrior990RelatedOrgs,Form990ScheduleJPartIII.Identifier,Form990ScheduleJPartIII.ReturnReference,Form990ScheduleJPartIII.Explanation,HousingAllowanceOrResidence,WrittenPolicyReTAndEExpenses,SubstantiationRequired,IdemnificationGrossUpPayments,DiscretionarySpendingAccount,ClubDuesOrFees,FirstClassOrCharterTravel,TravelForCompanions,Form990OfOtherOrganizations,PaymentsForUseOfResidence,Form990ScheduleJPartII.NameBusiness.BusinessNameLine1,PersonalServices,Form990ScheduleJPartII.NameBusiness.BusinessNameLine2,Form990ScheduleJPartII.Title,SeverancePaymentInd,SupplementalNonqualRtrPlanInd,EquityBasedCompArrngmInd,CompBasedOnRevenueOfFlngOrgInd,CompBsdOnRevRelatedOrgsInd,CompBsdNetEarnsFlngOrgInd,CompBsdNetEarnsRltdOrgsInd,AnyNonFixedPaymentsInd,InitialContractExceptionInd,RltdOrgOfficerTrstKeyEmplGrp.PersonNm,RltdOrgOfficerTrstKeyEmplGrp.TitleTxt,RltdOrgOfficerTrstKeyEmplGrp.BaseCompensationFilingOrgAmt,RltdOrgOfficerTrstKeyEmplGrp.CompensationBasedOnRltdOrgsAmt,RltdOrgOfficerTrstKeyEmplGrp.BonusFilingOrganizationAmount,RltdOrgOfficerTrstKeyEmplGrp.BonusRelatedOrganizationsAmt,RltdOrgOfficerTrstKeyEmplGrp.OtherCompensationFilingOrgAmt,RltdOrgOfficerTrstKeyEmplGrp.OtherCompensationRltdOrgsAmt,RltdOrgOfficerTrstKeyEmplGrp.DeferredCompensationFlngOrgAmt,RltdOrgOfficerTrstKeyEmplGrp.DeferredCompRltdOrgsAmt,RltdOrgOfficerTrstKeyEmplGrp.NontaxableBenefitsFilingOrgAmt,RltdOrgOfficerTrstKeyEmplGrp.NontaxableBenefitsRltdOrgsAmt,RltdOrgOfficerTrstKeyEmplGrp.TotalCompensationFilingOrgAmt,RltdOrgOfficerTrstKeyEmplGrp.TotalCompensationRltdOrgsAmt,RltdOrgOfficerTrstKeyEmplGrp.CompReportPrior990FilingOrgAmt,RltdOrgOfficerTrstKeyEmplGrp.CompReportPrior990RltdOrgsAmt,CompensationCommitteeInd,BoardOrCommitteeApprovalInd,RebuttablePresumptionProcInd,RltdOrgOfficerTrstKeyEmplGrp,IndependentConsultantInd,WrittenEmploymentContractInd,CompensationSurveyInd,SupplementalInformationDetail.FormAndLineReferenceDesc,SupplementalInformationDetail.ExplanationTxt,Form990OfOtherOrganizationsInd,DiscretionarySpendingAcctInd,WrittenPolicyRefTAndEExpnssInd,SubstantiationRequiredInd,SupplementalInformationDetail,TravelForCompanionsInd,IdemnificationGrossUpPmtsInd,ClubDuesOrFeesInd,HousingAllowanceOrResidenceInd,FirstClassOrCharterTravelInd,RltdOrgOfficerTrstKeyEmplGrp.BusinessName.BusinessNameLine1,PersonalServicesInd,PaymentsForUseOfResidenceInd,RltdOrgOfficerTrstKeyEmplGrp.BusinessName.BusinessNameLine2,@documentName,@softwareVersionNum,RltdOrgOfficerTrstKeyEmplGrp.BusinessName.BusinessNameLine1Txt,RltdOrgOfficerTrstKeyEmplGrp.BusinessName.BusinessNameLine2Txt
0,IRS990ScheduleJ,10000105,2010v3.2,X,X,X,False,True,False,False,False,False,False,False,False,False,"[{'NamePerson': 'THOMAS D TURNBULL', 'CompBasedOnRelatedOrgs': '100712', 'BonusRelatedOrgs': '790', 'OtherCompensationRelatedOrgs': '1257', 'DeferredCompRelatedOrgs': '54308', 'NontaxableBenefitsRelatedOrgs': '62342', 'TotalCompensationRelatedOrgs': '219409'}, {'NamePerson': 'RONALD W PATTERSON', 'CompBasedOnRelatedOrgs': '192455', 'BonusRelatedOrgs': '814', 'OtherCompensationRelatedOrgs': '2071', 'DeferredCompRelatedOrgs': '17271', 'NontaxableBenefitsRelatedOrgs': '23201', 'TotalCompensatio...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
"""
#####NEXT TIME RUN THIS INSTEAD OF ABOVE BLOCK
import timeit
start_time = timeit.default_timer()
df = pd.concat([df.drop(['USER'], axis=1), df['USER'].apply(pd.Series).add_prefix('USER_')], axis=1)
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60)
df[:1]
"""

In [47]:
dfj[:1]

Unnamed: 0,OrganizationName,URL,DLN,EIN,IRS990ScheduleJ
1,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,581805618,"{'@documentId': 'IRS990ScheduleJ', '@softwareId': '10000105', '@softwareVersion': '2010v3.2', 'CompensationCommittee': 'X', 'CompensationSurvey': 'X', 'BoardOrCommitteeApproval': 'X', 'SeverancePayment': 'false', 'SupplementalNonqualRetirePlan': 'true', 'EquityBasedCompArrangement': 'false', 'CompBasedOnRevenueOfFilingOrg': 'false', 'CompBasedOnRevenueRelatedOrgs': 'false', 'CompBasedNetEarningsFilingOrg': 'false', 'CompBasedNetEarningsRelateOrgs': 'false', 'AnyNonFixedPayments': 'false', 'I..."


In [48]:
print(len(j[j['Form990ScheduleJPartIII'].isnull()]))
print(len(j[j['Form990ScheduleJPartIII'].notnull()]))

719801
53313


In [54]:
dfj = dfj.reset_index()
dfj[:2]

Unnamed: 0,index,OrganizationName,URL,DLN,EIN,IRS990ScheduleJ
0,1,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,581805618,"{'@documentId': 'IRS990ScheduleJ', '@softwareId': '10000105', '@softwareVersion': '2010v3.2', 'CompensationCommittee': 'X', 'CompensationSurvey': 'X', 'BoardOrCommitteeApproval': 'X', 'SeverancePayment': 'false', 'SupplementalNonqualRetirePlan': 'true', 'EquityBasedCompArrangement': 'false', 'CompBasedOnRevenueOfFilingOrg': 'false', 'CompBasedOnRevenueRelatedOrgs': 'false', 'CompBasedNetEarningsFilingOrg': 'false', 'CompBasedNetEarningsRelateOrgs': 'false', 'AnyNonFixedPayments': 'false', 'I..."
1,2,HOUSTON VOA INDEPENDENT HOUSING INC HEIGHTS MANOR,https://s3.amazonaws.com/irs-form-990/201113139349301316_public.xml,93493313013161,581876019,"{'@documentId': 'IRS990ScheduleJ', '@softwareId': '10000105', '@softwareVersion': '2010v3.2', 'CompensationCommittee': 'X', 'CompensationSurvey': 'X', 'BoardOrCommitteeApproval': 'X', 'SeverancePayment': 'false', 'SupplementalNonqualRetirePlan': 'true', 'EquityBasedCompArrangement': 'false', 'CompBasedOnRevenueOfFilingOrg': 'false', 'CompBasedOnRevenueRelatedOrgs': 'false', 'CompBasedNetEarningsFilingOrg': 'false', 'CompBasedNetEarningsRelateOrgs': 'false', 'AnyNonFixedPayments': 'false', 'I..."


In [57]:
dfj[-2:]

Unnamed: 0,index,OrganizationName,URL,DLN,EIN,IRS990ScheduleJ
773112,988,,https://s3.amazonaws.com/irs-form-990/202441449349301409_public.xml,,,"{'@documentId': 'RetDoc6', 'CompensationCommitteeInd': 'X', 'IndependentConsultantInd': 'X', 'Form990OfOtherOrganizationsInd': 'X', 'WrittenEmploymentContractInd': 'X', 'CompensationSurveyInd': 'X', 'BoardOrCommitteeApprovalInd': 'X', 'SeverancePaymentInd': 'false', 'SupplementalNonqualRtrPlanInd': 'false', 'EquityBasedCompArrngmInd': 'false', 'CompBasedOnRevenueOfFlngOrgInd': 'false', 'CompBsdOnRevRelatedOrgsInd': 'false', 'CompBsdNetEarnsFlngOrgInd': 'false', 'CompBsdNetEarnsRltdOrgsInd': ..."
773113,2,,https://s3.amazonaws.com/irs-form-990/202441449349301564_public.xml,,,"{'@documentId': 'IRS990ScheduleJ', '@softwareId': '23017437', '@softwareVersionNum': '2023v5.0', 'CompensationCommitteeInd': 'X', 'CompensationSurveyInd': 'X', 'BoardOrCommitteeApprovalInd': 'X', 'SeverancePaymentInd': 'false', 'SupplementalNonqualRtrPlanInd': 'false', 'EquityBasedCompArrngmInd': 'false', 'CompBasedOnRevenueOfFlngOrgInd': 'false', 'CompBsdOnRevRelatedOrgsInd': 'false', 'CompBsdNetEarnsFlngOrgInd': 'false', 'CompBsdNetEarnsRltdOrgsInd': 'false', 'AnyNonFixedPaymentsInd': 'fal..."


In [60]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
dfj = pd.concat([dfj.drop(['index', 'IRS990ScheduleJ'], axis=1), j], axis=1)
print(len(df))
dfj[:1]

Current date and time :  2025-06-20 16:55:34 

3469008
CPU times: total: 1.06 s
Wall time: 1.13 s


Unnamed: 0,OrganizationName,URL,DLN,EIN,@documentId,@softwareId,@softwareVersion,CompensationCommittee,CompensationSurvey,BoardOrCommitteeApproval,SeverancePayment,SupplementalNonqualRetirePlan,EquityBasedCompArrangement,CompBasedOnRevenueOfFilingOrg,CompBasedOnRevenueRelatedOrgs,CompBasedNetEarningsFilingOrg,CompBasedNetEarningsRelateOrgs,AnyNonFixedPayments,InitialContractException,RebuttablePresumptionProcedure,Form990ScheduleJPartII,IndependentConsultant,WrittenEmploymentContract,Form990ScheduleJPartIII,HousingAllowanceOrResidence,WrittenPolicyReTAndEExpenses,SubstantiationRequired,IdemnificationGrossUpPayments,DiscretionarySpendingAccount,ClubDuesOrFees,FirstClassOrCharterTravel,TravelForCompanions,Form990OfOtherOrganizations,PaymentsForUseOfResidence,PersonalServices,SeverancePaymentInd,SupplementalNonqualRtrPlanInd,EquityBasedCompArrngmInd,CompBasedOnRevenueOfFlngOrgInd,CompBsdOnRevRelatedOrgsInd,CompBsdNetEarnsFlngOrgInd,CompBsdNetEarnsRltdOrgsInd,AnyNonFixedPaymentsInd,InitialContractExceptionInd,RltdOrgOfficerTrstKeyEmplGrp,CompensationCommitteeInd,BoardOrCommitteeApprovalInd,RebuttablePresumptionProcInd,IndependentConsultantInd,WrittenEmploymentContractInd,CompensationSurveyInd,SupplementalInformationDetail,Form990OfOtherOrganizationsInd,DiscretionarySpendingAcctInd,WrittenPolicyRefTAndEExpnssInd,SubstantiationRequiredInd,TravelForCompanionsInd,IdemnificationGrossUpPmtsInd,ClubDuesOrFeesInd,HousingAllowanceOrResidenceInd,FirstClassOrCharterTravelInd,PersonalServicesInd,PaymentsForUseOfResidenceInd,@documentName,@softwareVersionNum
0,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,581805618,IRS990ScheduleJ,10000105,2010v3.2,X,X,X,False,True,False,False,False,False,False,False,False,False,"[{'NamePerson': 'THOMAS D TURNBULL', 'CompBasedOnRelatedOrgs': '100712', 'BonusRelatedOrgs': '790', 'OtherCompensationRelatedOrgs': '1257', 'DeferredCompRelatedOrgs': '54308', 'NontaxableBenefitsRelatedOrgs': '62342', 'TotalCompensationRelatedOrgs': '219409'}, {'NamePerson': 'RONALD W PATTERSON', 'CompBasedOnRelatedOrgs': '192455', 'BonusRelatedOrgs': '814', 'OtherCompensationRelatedOrgs': '2071', 'DeferredCompRelatedOrgs': '17271', 'NontaxableBenefitsRelatedOrgs': '23201', 'TotalCompensatio...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [62]:
print(len(dfj))

773114


In [63]:
dfj.sample(2)

Unnamed: 0,OrganizationName,URL,DLN,EIN,@documentId,@softwareId,@softwareVersion,CompensationCommittee,CompensationSurvey,BoardOrCommitteeApproval,SeverancePayment,SupplementalNonqualRetirePlan,EquityBasedCompArrangement,CompBasedOnRevenueOfFilingOrg,CompBasedOnRevenueRelatedOrgs,CompBasedNetEarningsFilingOrg,CompBasedNetEarningsRelateOrgs,AnyNonFixedPayments,InitialContractException,RebuttablePresumptionProcedure,Form990ScheduleJPartII,IndependentConsultant,WrittenEmploymentContract,Form990ScheduleJPartIII,HousingAllowanceOrResidence,WrittenPolicyReTAndEExpenses,SubstantiationRequired,IdemnificationGrossUpPayments,DiscretionarySpendingAccount,ClubDuesOrFees,FirstClassOrCharterTravel,TravelForCompanions,Form990OfOtherOrganizations,PaymentsForUseOfResidence,PersonalServices,SeverancePaymentInd,SupplementalNonqualRtrPlanInd,EquityBasedCompArrngmInd,CompBasedOnRevenueOfFlngOrgInd,CompBsdOnRevRelatedOrgsInd,CompBsdNetEarnsFlngOrgInd,CompBsdNetEarnsRltdOrgsInd,AnyNonFixedPaymentsInd,InitialContractExceptionInd,RltdOrgOfficerTrstKeyEmplGrp,CompensationCommitteeInd,BoardOrCommitteeApprovalInd,RebuttablePresumptionProcInd,IndependentConsultantInd,WrittenEmploymentContractInd,CompensationSurveyInd,SupplementalInformationDetail,Form990OfOtherOrganizationsInd,DiscretionarySpendingAcctInd,WrittenPolicyRefTAndEExpnssInd,SubstantiationRequiredInd,TravelForCompanionsInd,IdemnificationGrossUpPmtsInd,ClubDuesOrFeesInd,HousingAllowanceOrResidenceInd,FirstClassOrCharterTravelInd,PersonalServicesInd,PaymentsForUseOfResidenceInd,@documentName,@softwareVersionNum
635278,,https://s3.amazonaws.com/irs-form-990/202331309349302628_public.xml,,,IRS990ScheduleJ,21013475.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,false,false,false,false,false,false,false,false,false,"[{'PersonNm': 'BRIAN PERRY', 'TitleTxt': 'DIRECTOR', 'BaseCompensationFilingOrgAmt': '141154', 'DeferredCompensationFlngOrgAmt': '8648', 'NontaxableBenefitsFilingOrgAmt': '26513', 'TotalCompensationFilingOrgAmt': '176315'}, {'PersonNm': 'BRUCE D COLLIER', 'TitleTxt': 'President', 'BaseCompensationFilingOrgAmt': '221209', 'DeferredCompensationFlngOrgAmt': '13248', 'NontaxableBenefitsFilingOrgAmt': '30584', 'TotalCompensationFilingOrgAmt': '265041'}, {'PersonNm': 'DAVID THOMAS', 'TitleTxt': 'E...",,X,,,,,,,,,,,,,,,,,,2021v4.1
339028,HUDSON VALLEY SENIOR RESIDENCE,https://s3.amazonaws.com/irs-form-990/201803099349301565_public.xml,93493309015658.0,141364545.0,RetDoc1042400001,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,0,0,0,0,0,"[{'PersonNm': 'FRANK TRIPODI', 'TitleTxt': 'BOARD MEMBER & PRESIDENT', 'BaseCompensationFilingOrgAmt': '0', 'CompensationBasedOnRltdOrgsAmt': '410951', 'BonusFilingOrganizationAmount': '0', 'BonusRelatedOrganizationsAmt': '0', 'OtherCompensationFilingOrgAmt': '0', 'OtherCompensationRltdOrgsAmt': '52530', 'DeferredCompensationFlngOrgAmt': '0', 'DeferredCompRltdOrgsAmt': '5400', 'NontaxableBenefitsFilingOrgAmt': '0', 'NontaxableBenefitsRltdOrgsAmt': '27560', 'TotalCompensationFilingOrgAmt': '0...",,,,,,,,,,,,,,,,,,,,


In [64]:
print(dfj.columns.tolist())

['OrganizationName', 'URL', 'DLN', 'EIN', '@documentId', '@softwareId', '@softwareVersion', 'CompensationCommittee', 'CompensationSurvey', 'BoardOrCommitteeApproval', 'SeverancePayment', 'SupplementalNonqualRetirePlan', 'EquityBasedCompArrangement', 'CompBasedOnRevenueOfFilingOrg', 'CompBasedOnRevenueRelatedOrgs', 'CompBasedNetEarningsFilingOrg', 'CompBasedNetEarningsRelateOrgs', 'AnyNonFixedPayments', 'InitialContractException', 'RebuttablePresumptionProcedure', 'Form990ScheduleJPartII', 'IndependentConsultant', 'WrittenEmploymentContract', 'Form990ScheduleJPartIII', 'HousingAllowanceOrResidence', 'WrittenPolicyReTAndEExpenses', 'SubstantiationRequired', 'IdemnificationGrossUpPayments', 'DiscretionarySpendingAccount', 'ClubDuesOrFees', 'FirstClassOrCharterTravel', 'TravelForCompanions', 'Form990OfOtherOrganizations', 'PaymentsForUseOfResidence', 'PersonalServices', 'SeverancePaymentInd', 'SupplementalNonqualRtrPlanInd', 'EquityBasedCompArrngmInd', 'CompBasedOnRevenueOfFlngOrgInd', 'Co

#### Save DF

In [65]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
dfj.to_pickle('Schedule J (flattened).pkl.gz', compression='gzip')

Current date and time :  2025-06-20 16:56:19 

CPU times: total: 2min 12s
Wall time: 2min 17s


# Read in Concordance File
We are going to read in two codebooks. First, there is the 'concordance' file. Specifically, before re-arranging and renaming variables, we will read in the relevant section from the *master concordance* file, and then use this file to identify the relevant 'compensation' variables. In a following notebook, we will be using the *new_variable_name* field as our variable name.

In [66]:
concordance = pd.read_excel('concordance - Schedule J.xlsx')
print('# of columns:', len(concordance.columns))
print('# of observations:', len(concordance))
concordance[:2]

# of columns: 15
# of observations: 105


Unnamed: 0.1,Unnamed: 0,xpath,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,MongoDB_Name,sub_key,MongoDB_Name2,MongoDB_Name3
0,0,/Return/ReturnData/IRS990ScheduleJ/ClubDuesOrFees,SJ_01_PC_CLUB_FEES,,,,,Club dues or fees,SCHED-J-PART-01-LINE-1a,PART-01,CheckboxType,ClubDuesOrFees,,ClubDuesOrFees,IRS990ScheduleJ
1,1,/Return/ReturnData/IRS990ScheduleJ/ClubDuesOrFeesInd,SJ_01_PC_CLUB_FEES,,,,,Club dues or fees,SCHED-J-PART-01-LINE-1a,PART-01,CheckboxType,ClubDuesOrFeesInd,,ClubDuesOrFeesInd,IRS990ScheduleJ


In [67]:
concordance['MongoDB_Name2'] = concordance['xpath'].str.split('/')
concordance['MongoDB_Name2'] = concordance['MongoDB_Name2'].apply(lambda x: x[-1])
concordance[['xpath', 'MongoDB_Name', 'MongoDB_Name2', 'variable_name_new']][:2]

Unnamed: 0,xpath,MongoDB_Name,MongoDB_Name2,variable_name_new
0,/Return/ReturnData/IRS990ScheduleJ/ClubDuesOrFees,ClubDuesOrFees,ClubDuesOrFees,SJ_01_PC_CLUB_FEES
1,/Return/ReturnData/IRS990ScheduleJ/ClubDuesOrFeesInd,ClubDuesOrFeesInd,ClubDuesOrFeesInd,SJ_01_PC_CLUB_FEES


In [68]:
concordance[(concordance['MongoDB_Name'].notnull())&(concordance['MongoDB_Name2']!=concordance['MongoDB_Name'])]

Unnamed: 0.1,Unnamed: 0,xpath,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,MongoDB_Name,sub_key,MongoDB_Name2,MongoDB_Name3


In [69]:
part2_cols = set(concordance[concordance['part']=='PART-02']['MongoDB_Name2'].tolist())
print(len(part2_cols))
part2_cols

36


{'BaseCompensationFilingOrg',
 'BaseCompensationFilingOrgAmt',
 'BonusFilingOrg',
 'BonusFilingOrganizationAmount',
 'BonusRelatedOrganizationsAmt',
 'BonusRelatedOrgs',
 'BusinessNameLine1',
 'BusinessNameLine1Txt',
 'BusinessNameLine2',
 'BusinessNameLine2Txt',
 'CompBasedOnRelatedOrgs',
 'CompReportPrior990FilingOrg',
 'CompReportPrior990FilingOrgAmt',
 'CompReportPrior990RelatedOrgs',
 'CompReportPrior990RltdOrgsAmt',
 'CompensationBasedOnRltdOrgsAmt',
 'DeferredCompFilingOrg',
 'DeferredCompRelatedOrgs',
 'DeferredCompRltdOrgsAmt',
 'DeferredCompensationFlngOrgAmt',
 'NamePerson',
 'NontaxableBenefitsFilingOrg',
 'NontaxableBenefitsFilingOrgAmt',
 'NontaxableBenefitsRelatedOrgs',
 'NontaxableBenefitsRltdOrgsAmt',
 'OtherCompensationFilingOrg',
 'OtherCompensationFilingOrgAmt',
 'OtherCompensationRelatedOrgs',
 'OtherCompensationRltdOrgsAmt',
 'PersonNm',
 'Title',
 'TitleTxt',
 'TotalCompensationFilingOrg',
 'TotalCompensationFilingOrgAmt',
 'TotalCompensationRelatedOrgs',
 'Total

In [70]:
concordance[['xpath', 'part', 'MongoDB_Name', 'MongoDB_Name2', 'MongoDB_Name3', 'variable_name_new']][-44:-6]

Unnamed: 0,xpath,part,MongoDB_Name,MongoDB_Name2,MongoDB_Name3,variable_name_new
61,/Return/ReturnData/IRS990ScheduleJ/Form990ScheduleJPartII/NamePerson,PART-02,,NamePerson,Form990ScheduleJPartII,SJ_02_PC_NAME_OFF_TRST_KEYEMP
62,/Return/ReturnData/IRS990ScheduleJ/RltdOrgOfficerTrstKeyEmplGrp/PersonNm,PART-02,,PersonNm,RltdOrgOfficerTrstKeyEmplGrp,SJ_02_PC_NAME_OFF_TRST_KEYEMP
63,/Return/ReturnData/IRS990ScheduleJ/Form990ScheduleJPartII/NameBusiness/BusinessNameLine1,PART-02,,BusinessNameLine1,NameBusiness,SJ_02_PC_NAME_OFF_TRST_KEYEMP_L1
64,/Return/ReturnData/IRS990ScheduleJ/RltdOrgOfficerTrstKeyEmplGrp/BusinessName/BusinessNameLine1,PART-02,,BusinessNameLine1,BusinessName,SJ_02_PC_NAME_OFF_TRST_KEYEMP_L1
65,/Return/ReturnData/IRS990ScheduleJ/RltdOrgOfficerTrstKeyEmplGrp/BusinessName/BusinessNameLine1Txt,PART-02,,BusinessNameLine1Txt,BusinessName,SJ_02_PC_NAME_OFF_TRST_KEYEMP_L1
66,/Return/ReturnData/IRS990ScheduleJ/Form990ScheduleJPartII/NameBusiness/BusinessNameLine2,PART-02,,BusinessNameLine2,NameBusiness,SJ_02_PC_NAME_OFF_TRST_KEYEMP_L2
67,/Return/ReturnData/IRS990ScheduleJ/RltdOrgOfficerTrstKeyEmplGrp/BusinessName/BusinessNameLine2,PART-02,,BusinessNameLine2,BusinessName,SJ_02_PC_NAME_OFF_TRST_KEYEMP_L2
68,/Return/ReturnData/IRS990ScheduleJ/RltdOrgOfficerTrstKeyEmplGrp/BusinessName/BusinessNameLine2Txt,PART-02,,BusinessNameLine2Txt,BusinessName,SJ_02_PC_NAME_OFF_TRST_KEYEMP_L2
69,/Return/ReturnData/IRS990ScheduleJ/Form990ScheduleJPartII/Title,PART-02,,Title,Form990ScheduleJPartII,SJ_02_PC_TITLE
70,/Return/ReturnData/IRS990ScheduleJ/RltdOrgOfficerTrstKeyEmplGrp/TitleTxt,PART-02,,TitleTxt,RltdOrgOfficerTrstKeyEmplGrp,SJ_02_PC_TITLE


In [71]:
['RltdOrgOfficerTrstKeyEmplGrp', 'Form990ScheduleJPartII']

['RltdOrgOfficerTrstKeyEmplGrp', 'Form990ScheduleJPartII']

In [72]:
concordance['MongoDB_Name3'] = concordance['xpath'].str.split('/')
concordance['MongoDB_Name3'] = concordance['MongoDB_Name3'].apply(lambda x: x[-2])
concordance[['xpath', 'part', 'MongoDB_Name', 'MongoDB_Name2', 'MongoDB_Name3', 'variable_name_new']][-6:]

Unnamed: 0,xpath,part,MongoDB_Name,MongoDB_Name2,MongoDB_Name3,variable_name_new
99,/Return/ReturnData/IRS990ScheduleJ/Form990ScheduleJPartIII/Explanation,PART-03,,Explanation,Form990ScheduleJPartIII,SJ_03_PC_EXPLANATION_TEXT
100,/Return/ReturnData/IRS990ScheduleJ/SupplementalInformationDetail/ExplanationTxt,PART-03,,ExplanationTxt,SupplementalInformationDetail,SJ_03_PC_EXPLANATION_TEXT
101,/Return/ReturnData/IRS990ScheduleJ/Form990ScheduleJPartIII/ReturnReference,PART-03,,ReturnReference,Form990ScheduleJPartIII,SJ_03_PC_FORM_AND_LINE_REFERENCE
102,/Return/ReturnData/IRS990ScheduleJ/SupplementalInformationDetail/FormAndLineReferenceDesc,PART-03,,FormAndLineReferenceDesc,SupplementalInformationDetail,SJ_03_PC_FORM_AND_LINE_REFERENCE
103,/Return/ReturnData/IRS990ScheduleJ/Form990ScheduleJPartIII/Identifier,PART-03,,Identifier,Form990ScheduleJPartIII,SJ_03_PC_IDENTIFIER
104,/Return/ReturnData/IRS990ScheduleJ/SupplementalInformationDetail/IdentifierTxt,PART-03,,IdentifierTxt,SupplementalInformationDetail,SJ_03_PC_IDENTIFIER


In [73]:
part2_cols = set(concordance[concordance['part']=='PART-02']['MongoDB_Name3'].tolist())
print(len(part2_cols))
part2_cols

4


{'BusinessName',
 'Form990ScheduleJPartII',
 'NameBusiness',
 'RltdOrgOfficerTrstKeyEmplGrp'}

In [74]:
part3_cols = set(concordance[concordance['part']=='PART-03']['MongoDB_Name3'].tolist())
print(len(part3_cols))
part3_cols

2


{'Form990ScheduleJPartIII', 'SupplementalInformationDetail'}

In [75]:
set(part2_cols).intersection(set(dfj.columns.tolist()))

{'Form990ScheduleJPartII', 'RltdOrgOfficerTrstKeyEmplGrp'}

In [76]:
set(part3_cols).intersection(set(dfj.columns.tolist()))

{'Form990ScheduleJPartIII', 'SupplementalInformationDetail'}

#### Save Part II of Schedule J

In [77]:
dfj[['URL', 'Form990ScheduleJPartII', 'RltdOrgOfficerTrstKeyEmplGrp']].sample(5)

Unnamed: 0,URL,Form990ScheduleJPartII,RltdOrgOfficerTrstKeyEmplGrp
527936,https://s3.amazonaws.com/irs-form-990/202230469349301663_public.xml,,"[{'PersonNm': 'Alex Corrales', 'TitleTxt': 'WHA CEO', 'BaseCompensationFilingOrgAmt': '0', 'CompensationBasedOnRltdOrgsAmt': '192142', 'BonusFilingOrganizationAmount': '0', 'BonusRelatedOrganizationsAmt': '0', 'OtherCompensationFilingOrgAmt': '0', 'OtherCompensationRltdOrgsAmt': '0', 'DeferredCompensationFlngOrgAmt': '0', 'DeferredCompRltdOrgsAmt': '33756', 'NontaxableBenefitsFilingOrgAmt': '0', 'NontaxableBenefitsRltdOrgsAmt': '71730', 'TotalCompensationFilingOrgAmt': '0', 'TotalCompensatio..."
308927,https://s3.amazonaws.com/irs-form-990/201800609349300120_public.xml,,"{'PersonNm': 'FRANK SIRIANNI', 'TitleTxt': 'PRESIDENT', 'BaseCompensationFilingOrgAmt': '165454', 'OtherCompensationFilingOrgAmt': '780', 'TotalCompensationFilingOrgAmt': '166234'}"
306149,https://s3.amazonaws.com/irs-form-990/201801009349301210_public.xml,,"{'PersonNm': 'LES WATERS', 'TitleTxt': 'ARTISTIC DIRECTOR', 'BaseCompensationFilingOrgAmt': '216644', 'CompensationBasedOnRltdOrgsAmt': '0', 'BonusFilingOrganizationAmount': '0', 'BonusRelatedOrganizationsAmt': '0', 'OtherCompensationFilingOrgAmt': '7378', 'OtherCompensationRltdOrgsAmt': '0', 'DeferredCompensationFlngOrgAmt': '4371', 'DeferredCompRltdOrgsAmt': '0', 'NontaxableBenefitsFilingOrgAmt': '6346', 'NontaxableBenefitsRltdOrgsAmt': '0', 'TotalCompensationFilingOrgAmt': '234739', 'Tota..."
507472,https://s3.amazonaws.com/irs-form-990/202202179349301750_public.xml,,"{'PersonNm': 'Laurie Szenicer', 'TitleTxt': 'Chief Executive & Dev. Officer', 'BaseCompensationFilingOrgAmt': '172250', 'BonusFilingOrganizationAmount': '10000', 'NontaxableBenefitsFilingOrgAmt': '3135', 'TotalCompensationFilingOrgAmt': '185385'}"
697642,https://s3.amazonaws.com/irs-form-990/202540169349300429_public.xml,,"{'PersonNm': 'CHRISTOPHER MCGHEE', 'TitleTxt': 'TRAINING COORD', 'BaseCompensationFilingOrgAmt': '159889', 'DeferredCompensationFlngOrgAmt': '101251', 'TotalCompensationFilingOrgAmt': '261140'}"


In [78]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
dfj[['URL', 'Form990ScheduleJPartII', 'RltdOrgOfficerTrstKeyEmplGrp']].to_pickle('Schedule J (Part II).pkl.gz', compression='gzip')

Current date and time :  2025-06-20 17:05:41 

CPU times: total: 1min 22s
Wall time: 1min 27s


#### Drop Part II columns

In [79]:
%%time
dfj = dfj.drop('Form990ScheduleJPartII', axis=1)
dfj = dfj.drop('RltdOrgOfficerTrstKeyEmplGrp', axis=1)

CPU times: total: 2.44 s
Wall time: 2.51 s


#### Save Part III

In [80]:
dfj[['URL', 'Form990ScheduleJPartIII', 'SupplementalInformationDetail']].sample(5)

Unnamed: 0,URL,Form990ScheduleJPartIII,SupplementalInformationDetail
705046,https://s3.amazonaws.com/irs-form-990/202422959349301912_public.xml,,
771345,https://s3.amazonaws.com/irs-form-990/202441359349301999_public.xml,,
680635,https://s3.amazonaws.com/irs-form-990/202343139349303809_public.xml,,"{'FormAndLineReferenceDesc': 'SCHEDULE J PART I', 'ExplanationTxt': 'LINE 3: COMPENSATION OF THE ORGANIZATION'S TOP MANAGEMENT OFFICIALS WAS PAID BY A RELATED ORGANIZATION. MENNONITE CHURCH BUILDINGS, INC. RELIED ON THE RELATED ORGANIZATION, WHICH USED SEVERAL OF THE METHODS DESCRIBED ON LINE 3, TO ESTABLISH TOP MANAGEMENT OFFICIAL COMPENSATION.'}"
75048,https://s3.amazonaws.com/irs-form-990/201300749349300535_public.xml,,
85020,https://s3.amazonaws.com/irs-form-990/201322209349300412_public.xml,"[{'Identifier': 'SchJ_P01_S00_L04', 'ReturnReference': 'Schedule J, Part I, Line 4', 'Explanation': 'Michael Cleary, former senior VP & COO, received a severance payment of $162,656. Cheryl Yager, Director of Credit, received a severance payment of $97,414.'}, {'Identifier': 'SchJ_P01_S00_L07', 'ReturnReference': 'Schedule J, Part I, Line 7', 'Explanation': 'ERCOT maintains an employee recognition award program whereby employees can receive one or more awards during the course of the year up...",


In [81]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
dfj[['URL', 'Form990ScheduleJPartIII', 'SupplementalInformationDetail']].to_pickle('Schedule J (Part III).pkl.gz', compression='gzip')

Current date and time :  2025-06-20 22:23:56 

CPU times: total: 22.2 s
Wall time: 25.8 s


In [82]:
dfj[dfj['URL']=='https://s3.amazonaws.com/irs-form-990/201401349349308135_public.xml']

Unnamed: 0,OrganizationName,URL,DLN,EIN,@documentId,@softwareId,@softwareVersion,CompensationCommittee,CompensationSurvey,BoardOrCommitteeApproval,SeverancePayment,SupplementalNonqualRetirePlan,EquityBasedCompArrangement,CompBasedOnRevenueOfFilingOrg,CompBasedOnRevenueRelatedOrgs,CompBasedNetEarningsFilingOrg,CompBasedNetEarningsRelateOrgs,AnyNonFixedPayments,InitialContractException,RebuttablePresumptionProcedure,IndependentConsultant,WrittenEmploymentContract,Form990ScheduleJPartIII,HousingAllowanceOrResidence,WrittenPolicyReTAndEExpenses,SubstantiationRequired,IdemnificationGrossUpPayments,DiscretionarySpendingAccount,ClubDuesOrFees,FirstClassOrCharterTravel,TravelForCompanions,Form990OfOtherOrganizations,PaymentsForUseOfResidence,PersonalServices,SeverancePaymentInd,SupplementalNonqualRtrPlanInd,EquityBasedCompArrngmInd,CompBasedOnRevenueOfFlngOrgInd,CompBsdOnRevRelatedOrgsInd,CompBsdNetEarnsFlngOrgInd,CompBsdNetEarnsRltdOrgsInd,AnyNonFixedPaymentsInd,InitialContractExceptionInd,CompensationCommitteeInd,BoardOrCommitteeApprovalInd,RebuttablePresumptionProcInd,IndependentConsultantInd,WrittenEmploymentContractInd,CompensationSurveyInd,SupplementalInformationDetail,Form990OfOtherOrganizationsInd,DiscretionarySpendingAcctInd,WrittenPolicyRefTAndEExpnssInd,SubstantiationRequiredInd,TravelForCompanionsInd,IdemnificationGrossUpPmtsInd,ClubDuesOrFeesInd,HousingAllowanceOrResidenceInd,FirstClassOrCharterTravelInd,PersonalServicesInd,PaymentsForUseOfResidenceInd,@documentName,@softwareVersionNum
137883,HARVEY L MILLER SUPPORTING FOUNDATION,https://s3.amazonaws.com/irs-form-990/201401349349308135_public.xml,93493134081354,900187252,RetDoc1042400001,,,,,,0,1,0,0,0,0,0,0,0,,,,"{'ReturnReference': 'Part I, Line 4b', 'Explanation': 'Steven B. Nasatir: In 1999 the Jewish Federation entered into an agreement with Steven Nasatir that was contingent upon 5 more years of service as president and CEO and would result in annual payments of $50,000 per year (net of tax) beginning at age 64 and lasting throughout his lifetime. Dr. Nasatir began receiving payments under this agreement in 2009. During 2012, the Jewish Federation entered into a second agreement with Steven Nasa...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


#### Drop Part III columns

In [83]:
%%time
dfj = dfj.drop('Form990ScheduleJPartIII', axis=1)
dfj = dfj.drop('SupplementalInformationDetail', axis=1)

CPU times: total: 2.31 s
Wall time: 2.43 s


#### Save Part I

In [100]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
dfj.to_pickle('Schedule J (Part I).pkl.gz', compression='gzip')

Current date and time :  2024-04-17 13:40:06 

CPU times: total: 24.7 s
Wall time: 25.7 s
