# Overiew

This is the fifth in a series of tutorials that illustrate how to download, extract, and parse the IRS 990 e-file data available at https://aws.amazon.com/public-data-sets/irs-990/

In the previous notebook we used the information contained in the concordance file to combine pairs of columns that reflect the same 990 variable, such as *TaxPeriodBeginDt* and *TaxPeriodBeginDate*, and assign the relevant 'standardized' name from the concordance file, and then 'binarized' relevant columns and, lastly, deleted unneeded columns.


The goal of this notebook is to parse all of the 'dictionary' columns, or those with 'nested' dictionary structures. For example, the data for one observation in the column in the XML file called *Filer* might look like this:

``{'EIN': '203840246', 'Name': {'BusinessNameLine1': 'NEW ALBANY WALKING CLUB INC'}, 'NameControl': 'NEWA', 'USAddress': {'AddressLine1': '4000 BAUGHMAN GRANT', 'City': 'NEW ALBANY', 'State': 'OH', 'ZIPCode': '43054'}}``

And the data for *F9_10_PC_UNSECURED_NOTES_BOY* may look like this:

``{'BOYAmt': '24000', 'EOYAmt': '47479'}``

In effect, multiple variables are nested under the same column extracted from the raw e-file data. In the concordance file I have added a column, called 'sub_key', that tells us that, for the variable *F9_10_PC_UNSECURED_NOTES_BOY*, we will want to extract data nested under the ``EOYAmt`` and ``EOY`` keys. I added this data after conducting extensive verifications on the data. 

Accordingly, our first step  in this notebook will be to read in the concordance file that has all the reconciled and verified variables to date:
- The file is called *concordance_VERIFIED.xlsx*

We then read in the PANDAS data file (N=2,104,435) saved in our last notebook: 
- *all filings August 2022 - all control variables (renamed).pkl.gz*

I then parse all columns that have Python dictionaries as values and then save an updated e-file dataframe:
- *all filings August 2022 - all control variables (with parsed sub-key variables).pkl.gz*

# Load Packages and Connect to MongoDB

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

In [2]:
print(pd.__version__)

1.4.3


In [3]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)

#### Set working directory

In [4]:
cd "C:\\Users\\Gregory\\IRS 990 Control Variables\\"

C:\Users\Gregory\IRS 990 Control Variables


# Read in Concordance File
Read in the 'concordance' file. This codebook will help us identify the variables that contain dictionaries. We will then use the 'sub-key' column to help parse these columns. 

In [5]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
concordance = pd.read_excel('concordance_VERIFIED.xlsx')
print('# of columns:', len(concordance.columns))
print('# of observations:', len(concordance))
concordance[:2]

Current date and time :  2022-08-10 12:35:34 

# of columns: 17
# of observations: 574
Wall time: 2.56 s


Unnamed: 0,xpath,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,python_data_type,fill_null,BINARIZE,MongoDB_Name,sub_key,sub_sub_key,cardinality
0,/Return/ReturnData/IRS990/SpecialConditionDesc,F9_00_HD_SPECIAL_CONDITION_DESC,,,,,Special condition description,F990-PC-PART-00,PART-00,TextType,string,Do not fill null,,SpecialConditionDesc,,,
1,/Return/ReturnData/IRS990/SpecialConditionDescription,F9_00_HD_SPECIAL_CONDITION_DESC,31.0,,,,Special condition description,F990-PC-PART-00,PART-00,TextType,string,Do not fill null,,SpecialConditionDescription,,,


In [17]:
concordance['cardinality'].value_counts()

ONE     88
MANY     2
Name: cardinality, dtype: int64

In [18]:
concordance[concordance['cardinality']=='MANY']

Unnamed: 0,xpath,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,python_data_type,fill_null,BINARIZE,MongoDB_Name,sub_key,sub_sub_key,cardinality
422,/Return/ReturnData/IRS990/OtherExpenses/Total,F9_09_EXP_OTH_TOT,,,,,All Other expenses - total expense,F990-PC-PART-09-LINE-24A,PART-09,USAmountType,Int64,,,OtherExpenses,Total,,MANY
423,/Return/ReturnData/IRS990/OtherExpensesGrp/TotalAmt,F9_09_EXP_OTH_TOT,,,,,Other Expenses - total expense,F990-PC-PART-09-LINE-24A,PART-09,USAmountType,Int64,,,OtherExpensesGrp,TotalAmt,,MANY


In [19]:
concordance[concordance['sub_key'].notnull()][['variable_name_new', 'MongoDB_Name', 'sub_key']]

Unnamed: 0,variable_name_new,MongoDB_Name,sub_key
122,F9_03_PC_PROG_SVC_ACC_2_CODE,Activity2,ActivityCode
123,F9_03_PC_PROG_SVC_ACC_2_CODE,ProgSrvcAccomActy2Grp,ActivityCode
124,F9_03_PC_PROG_SVC_ACC_3_CODE,Activity3,ActivityCode
125,F9_03_PC_PROG_SVC_ACC_3_CODE,ProgSrvcAccomActy3Grp,ActivityCode
146,F9_03_PC_PROG_SVC_ACC_2_DESC,Activity2,Description
...,...,...,...
556,F9_00_HD_FILER_CITY_US,Filer,USAddress
557,F9_00_HD_FILER_COUNTRY_FRGN,Filer,ForeignAddress
558,F9_00_HD_FILER_COUNTRY_FRGN,Filer,ForeignAddress
559,F9_00_HD_FILER_ZIP_US,Filer,USAddress


In [20]:
subkeycols = list(set(concordance[concordance['sub_key'].notnull()]['variable_name_new'].tolist()))
print(len(subkeycols))
subkeycols

109


['F9_09_PC_COMP_OFFICERS_PROG_SVCE',
 'F9_08_PC_TOTAL_REVENUE',
 'F9_09_EXP_OCCUPANCY_TOT',
 'F9_03_PC_PROG_SVC_ACC_2_DESC',
 'F9_09_PC_COMP_DISQUAL_MGMT',
 'F9_00_HD_FILER_COUNTRY_FRGN',
 'F9_10_ASSETS_NOTES_LOANS_NET_EOY',
 'F9_09_PC_PAYROLL_TAX_FUNDRAISE',
 'F9_09_PC_OTHER_EMP_BEN_PROG_SVCE',
 'F9_09_PC_PAYMENT_TO_AFFILIATES',
 'F9_09_PC_COMP_OFFICERS_MGMT',
 'F9_09_PC_OTHER_SALARY_PROG_SVCE',
 'F9_00_HD_FILER_ZIP_US',
 'F9_09_EXP_DEPREC_PROG',
 'F9_10_NAFB_RESTRICT_TEMP_EOY',
 'F9_10_PC_INVEST_PROG_RELTD_EOY',
 'F9_10_PC_BOND_LIABILITY_EOY',
 'F9_10_ASSETS_PLEDGES_NET_EOY',
 'F9_03_PC_PROG_SVC_ACC_2_EXP',
 'F9_10_ASSETS_LESS_DEPREC_EOY',
 'F9_09_PC_PAYROLL_TAX_TOTAL',
 'F9_10_LIAB_REV_DEFERRED_EOY',
 'F9_09_PC_COMP_OFFICERS_FUNDRAISE',
 'F9_10_LIAB_LOANS_OFF_EOY',
 'F9_09_EXP_CONF_MEETING_TOT',
 'F9_09_EXP_GRANT_ORG_DMSTC_TOT',
 'F9_09_PC_OTHER_EMP_BEN_TOTAL',
 'F9_09_PC_FEES_FOR_SVCE_LEGL_TOT',
 'F9_03_PC_PROG_SVC_ACC_3_EXP',
 'F9_10_PC_INVEST_OTHER_SEC_EOY',
 'F9_09_PC_PENSION_CO

# Read 990 Data 
In the following code block we read the file (produced in the previous notebook) containing all filings into a PANDAS dataframe.

In [10]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df = pd.read_pickle('all filings August 2022 - all control variables (renamed).pkl.gz', compression='gzip')
print('# of columns:', len(df.columns))
print('# of observations:', len(df))
df[:1]

Current date and time :  2022-08-10 12:35:57 

# of columns: 289
# of observations: 2104435
Wall time: 4min 19s


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,F9_09_PC_FEES_FOR_SVCE_FR_TOT,Filer,F9_00_HD_BUILD_TIME_STAMP,fiscal_year,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_BEGIN,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_03_PC_PGMSVC_SIGNIF_CHG,F9_03_PC_PGMSVC_SIGNIF_NEW,F9_03_PC_PROG_SVC_ACC_1_CODE,F9_03_PC_PROG_SVC_ACC_1_DESC,F9_03_PC_PROG_SVC_ACC_1_EXP,F9_03_PC_PROG_SVC_ACC_1_GRNT,F9_03_PC_PROG_SVC_ACC_1_REV,F9_03_PC_PROG_SVC_ACC_2_CODE,F9_03_PC_PROG_SVC_ACC_2_DESC,F9_03_PC_PROG_SVC_ACC_2_EXP,F9_03_PC_PROG_SVC_ACC_2_GRNT,F9_03_PC_PROG_SVC_ACC_2_REV,F9_03_PC_PROG_SVC_ACC_3_CODE,F9_03_PC_PROG_SVC_ACC_3_DESC,F9_03_PC_PROG_SVC_ACC_3_EXP,F9_03_PC_PROG_SVC_ACC_3_GRNT,F9_03_PC_PROG_SVC_ACC_3_REV,F9_03_PC_TOT_OTH_PROG_SVC_EXP,F9_03_PC_TOT_OTH_PROG_SVC_GRNT,F9_03_PC_TOT_OTH_PROG_SVC_REV,F9_03_PC_TOT_PROG_SVC_EXPENSE,F9_03_PZ_MISSION_DESCRIPTION,F9_03_PZ_SCHEDULE_O_PART3,F9_04_PC_ACTVITIES_VIA_PARTNER,F9_04_PC_CONTROLLED_ENTITY,F9_04_PC_DISREGARDED_ENTITY,F9_04_PC_EXCESS_BENEFIT_TRANS,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_LOBBYING_ACTIVITIES,F9_04_PC_POLITICAL_ACTIVITIES,F9_04_PC_PRIOR_EXCESS_BEN_TRAN,F9_04_PC_PROF_FR_EXP_GT_15K,F9_04_PC_RELATED_ENTITY,F9_04_PC_TRANS_TO_CNTRLD_ENT,F9_04_PC_TRANS_WITH_CNTRLD_ENT,F9_05_EXP_SCHED_O_X,F9_05_PC_NUMBER_EMPLOYEES_W3,F9_05_PC_NUMBER_FORMS_1096,F9_05_PC_UNRELATED_BUS_INCOME,F9_06_EXP_SCHED_O_X,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_EXP_SCHED_O_X,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_EXP_SCHED_O_X,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_EXP_AD_PROMO_TOT,F9_09_EXP_BENF_PAID_MEMB_TOT,F9_09_EXP_CONF_MEETING_TOT,F9_09_EXP_DEPREC_FUNDR,F9_09_EXP_DEPREC_MAG,F9_09_EXP_DEPREC_PROG,F9_09_EXP_DEPREC_TOT,F9_09_EXP_GRANT_FRGN_TOT,F9_09_EXP_GRANT_INDIV_DMSTC_TOT,F9_09_EXP_GRANT_ORG_DMSTC_TOT,F9_09_EXP_INFO_TECH_TOT,F9_09_EXP_INSURANCE_TOT,F9_09_EXP_INTEREST_TOT,F9_09_EXP_JOINT_COSTS_TOT,F9_09_EXP_OCCUPANCY_TOT,F9_09_EXP_OFFICE_TOT,F9_09_EXP_OTH_OTH_TOT,F9_09_EXP_OTH_TOT,F9_09_EXP_ROY_TOT,F9_09_EXP_SCHED_O_X,F9_09_EXP_TRAVEL_ENTRTNMNT_TOT,F9_09_EXP_TRAVEL_TOT,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYMENT_TO_AFFILIATES,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_ASSETS_ACC_NET_EOY,F9_10_ASSETS_EXP_PREPAID_EOY,F9_10_ASSETS_INTANGIB_EOY,F9_10_ASSETS_INVENT_SALE_EOY,F9_10_ASSETS_LESS_DEPREC_EOY,F9_10_ASSETS_LOANS_DISQUAL_EOY,F9_10_ASSETS_NOTES_LOANS_NET_EOY,F9_10_ASSETS_OTH_EOY,F9_10_ASSETS_PLEDGES_NET_EOY,F9_10_LIAB_ACC_PAYABLE_EOY,F9_10_LIAB_GRANTS_PAYABLE_EOY,F9_10_LIAB_LOANS_OFF_EOY,F9_10_LIAB_REV_DEFERRED_EOY,F9_10_NAFB_RESTRICT_PERM_EOY,F9_10_NAFB_RESTRICT_TEMP_EOY,F9_10_NAFB_UNRESTRICT_EOY,F9_10_PC_BOND_LIABILITY_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_ESCROW_LIABILITY_EOY,F9_10_PC_INVEST_OTHER_SEC_EOY,F9_10_PC_INVEST_PROG_RELTD_EOY,F9_10_PC_INVEST_PUB_TRADED_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_SECURE_MORT_NOTES_EOY,F9_10_PC_UNSECURED_LOANS_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_10_SCHED_O_X,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_11_SCHED_O_X,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,F9_12_SCHED_O_X,number_of_other_prog_svces,501c3
0,NEW ALBANY WALKING CLUB INC,https://s3.amazonaws.com/irs-form-990/201121029349300402_public.xml,93493102004021,201012,203840246,,"{'EIN': '203840246', 'Name': {'BusinessNameLine1': 'NEW ALBANY WALKING CLUB INC'}, 'NameControl': 'NEWA', 'USAddress': {'AddressLine1': '4000 BAUGHMAN GRANT', 'City': 'NEW ALBANY', 'State': 'OH', 'ZIPCode': '43054'}}",2016-02-24 21:20:13Z,2010,,,,,,1,,,299757,0,,,PHILIP HEIT,"{'Name': 'PHILIP HEIT', 'Title': 'PRESIDENT', 'DateSigned': '2011-03-22', 'AuthorizeThirdParty': 'true'}",,,2010-01-01,2010-12-31,2010,2011-04-12T08:58:08-05:00,,1,,,,NEWALBANYWALKINGCLUB.COM,,,172910,105000,110600,7,131,105105,104478,2094,0,,134744,62558,26891,,167663,215078,6717,0,,0,241969,0,,7,0,114700,125,167663,A CLUB DEDICATED TO PROMOTING WALKING FOR HEALTH AND COMPETITION.,122499,0,126722,0,,105105,237199,,299757,0,0,,"TO PROMOTE HEALTH AND PREVENT DISEASE, ENCOURAGE A SUPPORTIVE ENVIRONMENT, AND ELEVATE THE STATUS OF WALKING AND ITS BENEFITS TO INDIVIDUALS AND THE CENTRAL OHIO COMMUNITY.",228055,114700,126847,,,,,,,,,,,,,,228055,A CLUB DEDICATED TO PROMOTING WALKING FOR HEALTH AND COMPETITION.,,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,0,0,1,0,,0,0,0,0,0,0,0,0,1,,1,0,,0,0,0,1,1,,7,7,0,0,,,,OH,0,,0,0,1,,,0,,,,,172000,,,,,,,,,,,910,,126722,,172910,,126722,"{'TotalRevenueColumn': '299757', 'RelatedOrExemptFunctionIncome': '126847'}",,,,"{'Total': '1213', 'ProgramServices': '1213'}","{'Total': '1213', 'ProgramServices': '1213'}","{'Total': '1213', 'ProgramServices': '1213'}","{'Total': '1213', 'ProgramServices': '1213'}",,,"{'Total': '114700', 'ProgramServices': '114700'}",,,,,,"{'Total': '1352', 'ManagementAndGeneral': '1352'}","{'Total': '19955', 'ProgramServices': '19955'}","[{'Description': 'RACE CLOTHING', 'Total': '37423', 'ProgramServices': '37423'}, {'Description': 'RACE MANAGEMENT FEE', 'Total': '30069', 'ProgramServices': '30069'}, {'Description': 'RACE EXPENSES', 'Total': '17372', 'ProgramServices': '17372'},...",,,,,,,,,,,,,"{'Total': '1000', 'ManagementAndGeneral': '1000'}",,,,,"{'Total': '75', 'ManagementAndGeneral': '75'}",,,,,,,,,,,,,,,,,,"{'Total': '237199', 'ProgramServices': '228055', 'ManagementAndGeneral': '2427', 'Fundraising': '6717'}","{'Total': '237199', 'ProgramServices': '228055', 'ManagementAndGeneral': '2427', 'Fundraising': '6717'}","{'Total': '237199', 'ProgramServices': '228055', 'ManagementAndGeneral': '2427', 'Fundraising': '6717'}","{'Total': '237199', 'ProgramServices': '228055', 'ManagementAndGeneral': '2427', 'Fundraising': '6717'}",,,,,"{'BOY': '537', 'EOY': '322'}",,,,,,,,,,,,,"{'BOY': '104568', 'EOY': '167341'}","{'BOY': '104568', 'EOY': '167341'}",,,,,2883,2561,,,1,,"{'BOY': '105105', 'EOY': '167663'}",,,,,,,,"{'BOY': '105105', 'EOY': '167663'}",,,,,62558,,,0,,1,,,,,0,,,1


<br>Print out list of all 289 columns

In [12]:
print(df.columns.tolist())

['OrganizationName', 'URL', 'DLN', 'TaxPeriod', 'EIN', 'F9_09_PC_FEES_FOR_SVCE_FR_TOT', 'Filer', 'F9_00_HD_BUILD_TIME_STAMP', 'fiscal_year', 'F9_00_HD_ADDR_CHANGE', 'F9_00_HD_AMENDED_RETURN', 'F9_00_HD_CTRY_OF_DOMICILE', 'F9_00_HD_EXEMPT_STATUS_4847A1', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_EXEMPT_STATUS_501C3', 'F9_00_HD_FINAL_RETURN', 'F9_00_HD_GROSS_EXEMPT_NUM', 'F9_00_HD_GROSS_RCPT', 'F9_00_HD_GROUP_RETURN', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 'F9_00_HD_INITIAL_RETURN', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_SIGNING_OFFICER_SIGNTR', 'F9_00_HD_SPECIAL_CONDITION_DESC', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TAX_PER_BEGIN', 'F9_00_HD_TAX_PER_END', 'F9_00_HD_TAX_YEAR', 'F9_00_HD_TIME_STAMP', 'F9_00_HD_TYPE_ORG_ASSOCIATION', 'F9_00_HD_TYPE_ORG_CORP', 'F9_00_HD_TYPE_ORG_OTHER', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_TYPE_ORG_TRUST', 'F9_00_HD_WEBSITE', 'F9_00_HD_YEAR_FORMED', 'F9_01_PC_BEN_PAID_MEMB_PRIOR', 'F9_01_PC_CONTR_GRANTS_CURR', 'F9_01_PC_CONTR_GRANTS_PRIOR', 'F9_01_PC_GRANTS_PR

# CODE TO FLATTEN DICTIONARY

### Combine Variables in *Concordance* File

In [21]:
df[['F9_00_HD_BUILD_TIME_STAMP' ,'F9_00_HD_TIME_STAMP', 'F9_00_HD_TAX_YEAR', 'TaxPeriod']].sample(5)

Unnamed: 0,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_TIME_STAMP,F9_00_HD_TAX_YEAR,TaxPeriod
351276,2016-03-07 17:11:31Z,2013-11-05T06:52:25-08:00,2012,201212
375869,2015-11-30 17:44:51Z,2014-04-15T18:36:25-05:00,2012,201305
305666,2016-02-24 21:20:13Z,2013-05-15T11:36:58-05:00,2011,201206
1349048,2018-06-14 16:35:46Z,2018-08-28T15:51:14-05:00,2017,201806
2091145,2021-01-29 14:40:06Z,2020-11-13T12:26:58-08:00,2019,201912


### Collapse concordance file
We'll aggregate the concordance file in order to get the list of valid 'sub-keys' for each nested/dictionary variable
- Note: I added 'cardinality' to *new_variables_df* in order to deal with *F9_09_EXP_OTH_TOT*

In [14]:
def agg_funcs(x):
    names = {
        #'name': x['variable_name_new'].head(1).values[0],
        'original_names':  list(set(x['MongoDB_Name'].tolist())),
        'sub_keys':  list(set(x['sub_key'].tolist())),
        'data_type_xsd': x['data_type_xsd'].head(1).values[0],
        'cardinality': x['cardinality'].head(1).values[0]
    }
    #THE FOLLOWING SHORTCUT WORKS BUT CHANGES THE ORDER OF THE COLUMNS
    #return pd.Series(names, index = list(names.keys()))
    return pd.Series(names, index=['original_names', 'sub_keys', 'data_type_xsd', 'cardinality'])
new_variables_df = concordance[concordance['sub_key'].notnull()][:].groupby(['variable_name_new']).apply(agg_funcs)
new_variables_df = new_variables_df.reset_index()
print('# of variables:', len(new_variables_df))
new_variables_df[:]

# of variables: 109


Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,cardinality
0,F9_00_HD_FILER_ADDR_US_L1,[Filer],[USAddress],StreetAddressType,
1,F9_00_HD_FILER_ADDR_US_L2,[Filer],[USAddress],StreetAddressType,
2,F9_00_HD_FILER_CITY_US,[Filer],[USAddress],CityType,
3,F9_00_HD_FILER_COUNTRY_FRGN,[Filer],[ForeignAddress],CountryType,
4,F9_00_HD_FILER_STATE_US,[Filer],[USAddress],StateType,
...,...,...,...,...,...
104,F9_10_PC_SECURE_MORT_NOTES_EOY,"[MortNotesPyblSecuredInvestProp, MortgNotesPyblScrdInvstPropGrp]","[EOY, EOYAmt]",USAmountType,
105,F9_10_PC_UNSECURED_LOANS_EOY,"[UnsecuredNotesLoansPayableGrp, UnsecuredNotesLoansPayable]","[EOY, EOYAmt]",USAmountType,
106,F9_10_PC_UNSECURED_NOTES_BOY,"[UnsecuredNotesLoansPayableGrp, UnsecuredNotesLoansPayable]","[BOYAmt, BOY]",USAmountType,
107,F9_10_PC_UNSECURED_NOTES_EOY,"[UnsecuredNotesLoansPayableGrp, UnsecuredNotesLoansPayable]","[EOY, EOYAmt]",USAmountType,


<br>One variable in this list has a value of 'MANY' for *cardinality*

In [22]:
new_variables_df['cardinality'].value_counts()

ONE     36
MANY     1
Name: cardinality, dtype: int64

In [23]:
new_variables_df[new_variables_df['cardinality']=='MANY']

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,cardinality
35,F9_09_EXP_OTH_TOT,"[OtherExpensesGrp, OtherExpenses]","[Total, TotalAmt]",USAmountType,MANY


<br>Create new variable in the collapsed concordance file to indicate the number of original names for each variable in the XML (e-file) data. Seven of the variables have one original name while 102 have two.

In [24]:
new_variables_df['len'] = new_variables_df['original_names'].apply(lambda x: len(x))
print(new_variables_df['len'].value_counts(), '\n')
new_variables_df

2    102
1      7
Name: len, dtype: int64 



Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,cardinality,len
0,F9_00_HD_FILER_ADDR_US_L1,[Filer],[USAddress],StreetAddressType,,1
1,F9_00_HD_FILER_ADDR_US_L2,[Filer],[USAddress],StreetAddressType,,1
2,F9_00_HD_FILER_CITY_US,[Filer],[USAddress],CityType,,1
3,F9_00_HD_FILER_COUNTRY_FRGN,[Filer],[ForeignAddress],CountryType,,1
4,F9_00_HD_FILER_STATE_US,[Filer],[USAddress],StateType,,1
...,...,...,...,...,...,...
104,F9_10_PC_SECURE_MORT_NOTES_EOY,"[MortNotesPyblSecuredInvestProp, MortgNotesPyblScrdInvstPropGrp]","[EOY, EOYAmt]",USAmountType,,2
105,F9_10_PC_UNSECURED_LOANS_EOY,"[UnsecuredNotesLoansPayableGrp, UnsecuredNotesLoansPayable]","[EOY, EOYAmt]",USAmountType,,2
106,F9_10_PC_UNSECURED_NOTES_BOY,"[UnsecuredNotesLoansPayableGrp, UnsecuredNotesLoansPayable]","[BOYAmt, BOY]",USAmountType,,2
107,F9_10_PC_UNSECURED_NOTES_EOY,"[UnsecuredNotesLoansPayableGrp, UnsecuredNotesLoansPayable]","[EOY, EOYAmt]",USAmountType,,2


<br>We'll also do the same for the number of sub-keys for each of the dictionary variables.

In [25]:
new_variables_df['len_subkeys'] = new_variables_df['sub_keys'].apply(lambda x: len(x))
print(new_variables_df['len_subkeys'].value_counts(), '\n')
new_variables_df

2    100
1      9
Name: len_subkeys, dtype: int64 



Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,cardinality,len,len_subkeys
0,F9_00_HD_FILER_ADDR_US_L1,[Filer],[USAddress],StreetAddressType,,1,1
1,F9_00_HD_FILER_ADDR_US_L2,[Filer],[USAddress],StreetAddressType,,1,1
2,F9_00_HD_FILER_CITY_US,[Filer],[USAddress],CityType,,1,1
3,F9_00_HD_FILER_COUNTRY_FRGN,[Filer],[ForeignAddress],CountryType,,1,1
4,F9_00_HD_FILER_STATE_US,[Filer],[USAddress],StateType,,1,1
...,...,...,...,...,...,...,...
104,F9_10_PC_SECURE_MORT_NOTES_EOY,"[MortNotesPyblSecuredInvestProp, MortgNotesPyblScrdInvstPropGrp]","[EOY, EOYAmt]",USAmountType,,2,2
105,F9_10_PC_UNSECURED_LOANS_EOY,"[UnsecuredNotesLoansPayableGrp, UnsecuredNotesLoansPayable]","[EOY, EOYAmt]",USAmountType,,2,2
106,F9_10_PC_UNSECURED_NOTES_BOY,"[UnsecuredNotesLoansPayableGrp, UnsecuredNotesLoansPayable]","[BOYAmt, BOY]",USAmountType,,2,2
107,F9_10_PC_UNSECURED_NOTES_EOY,"[UnsecuredNotesLoansPayableGrp, UnsecuredNotesLoansPayable]","[EOY, EOYAmt]",USAmountType,,2,2


### Write extended 'lambda' functions to parse sub-key variables
Here we will write two functions to deal with dictionary variables that have one and two nested sub-keys, respectively. I have leaned here on functions on Stack Overflow: https://stackoverflow.com/questions/48872234/using-apply-in-pandas-lambda-functions-with-multiple-if-statements?noredirect=1&lq=1

The trick is that these functions will return the nested sub-key value if it exists (and no value if the nested key(s) do not exist). We will apply these functions in loops later on in this notebook.

In [28]:
def func_onekey(x, key1):
    if pd.isnull(x):
        return np.nan
    #else: 
    #    mydict = ast.literal_eval(x)
    elif key1 in x.keys():
        return x[key1]
    else:
        return np.nan

In [29]:
def func(x, key1, key2):
    if pd.isnull(x):
        return np.nan
    #else: 
    #    mydict = ast.literal_eval(x)
    elif key1 in x.keys():
        return x[key1]
    elif key2 in x.keys():
        return x[key2]
    else:
        return np.nan

<br>Show the nine variables with a single nested sub-key

In [30]:
new_variables_df[new_variables_df['len_subkeys']!=2]

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,cardinality,len,len_subkeys
0,F9_00_HD_FILER_ADDR_US_L1,[Filer],[USAddress],StreetAddressType,,1,1
1,F9_00_HD_FILER_ADDR_US_L2,[Filer],[USAddress],StreetAddressType,,1,1
2,F9_00_HD_FILER_CITY_US,[Filer],[USAddress],CityType,,1,1
3,F9_00_HD_FILER_COUNTRY_FRGN,[Filer],[ForeignAddress],CountryType,,1,1
4,F9_00_HD_FILER_STATE_US,[Filer],[USAddress],StateType,,1,1
5,F9_00_HD_FILER_ZIP_US,[Filer],[USAddress],ZIPCodeType,,1,1
7,F9_03_PC_PROG_SVC_ACC_2_CODE,"[Activity2, ProgSrvcAccomActy2Grp]",[ActivityCode],IntegerNNType,,2,1
12,F9_03_PC_PROG_SVC_ACC_3_CODE,"[ProgSrvcAccomActy3Grp, Activity3]",[ActivityCode],IntegerNNType,,2,1
98,F9_10_PC_LOANS_FROM_OFFICERS_EOY,"[LoansFromOfficersDirectorsGrp, LoansFromOfficersDirectors]",[EOYAmt],USAmountType,,2,1


### Process 
I first deal with a handful of variables a la carte. In future iterations of this notebook I will incorporate into the loop we'll be processing later on. 

Note: the reason these are being dealt with separately here is that the *Filer* variables contain double-nested data and I have yet to add the 'sub-sub-keys' to the concordance file. So, in the next code block I am first transforming the data for these four variables to be the value of the nested sub-key *USAddress*. Then in the subsequent code blocks we further transfrom these four variables to take the value of the sub-sub-key.

In [32]:
%%time
df['F9_00_HD_FILER_ADDR_US_L1'] = df['Filer'][:].apply(func_onekey, key1='USAddress')
df['F9_00_HD_FILER_ADDR_US_L2'] = df['Filer'][:].apply(func_onekey, key1='USAddress')
df['F9_00_HD_FILER_CITY_US'] = df['Filer'][:].apply(func_onekey, key1='USAddress')
df['F9_00_HD_FILER_ZIP_US'] = df['Filer'][:].apply(func_onekey, key1='USAddress')

Wall time: 37.8 s


<br>The following variable is not of great interest to us so I'm just going to take the value of the entire *ForeignAddress* key and not parse it any further.

In [33]:
%%time
df['F9_00_HD_FILER_COUNTRY_FRGN'] = df['Filer'][:].apply(func_onekey, key1='ForeignAddress')

Wall time: 9.05 s


<br>Now let's parse the four variables noted above in order. Note that for each of these four we are applying our custom function ``func`` and transforming the variable to be the value of either of the two sub-keys. Recall a few things here. First, looking at *F9_00_HD_FILER_ADDR_US_L1*, above we have already changed this variable to be not all of what is contained under *Filer* but only the *USAddress* key. Below we then change the variable to be the value of either 'AddressLine1' or 'AddressLine1Txt'. Each filing will only have one of these two sub-keys depending on the year of the filing.

In [34]:
%%time
df['F9_00_HD_FILER_ADDR_US_L1'] = df['F9_00_HD_FILER_ADDR_US_L1'][:].apply(func, key1='AddressLine1', key2='AddressLine1Txt')

Wall time: 9.64 s


In [35]:
%%time
df['F9_00_HD_FILER_ADDR_US_L2'] = df['F9_00_HD_FILER_ADDR_US_L2'][:].apply(func, key1='AddressLine2', key2='AddressLine2Txt')

Wall time: 9.25 s


In [36]:
%%time
df['F9_00_HD_FILER_CITY_US'] = df['F9_00_HD_FILER_CITY_US'][:].apply(func, key1='City', key2='CityNm')

Wall time: 9.08 s


In [37]:
%%time
df['F9_00_HD_FILER_ZIP_US'] = df['F9_00_HD_FILER_ZIP_US'][:].apply(func, key1='ZIPCd', key2='ZIPCode')

Wall time: 9.07 s


<br>Now let's take a look at these four variables in a random sample of five filings. All appear to be parsed successfully.

In [38]:
%%time
df[['F9_00_HD_FILER_ADDR_US_L1', 'F9_00_HD_FILER_ADDR_US_L2', 'F9_00_HD_FILER_CITY_US', 'F9_00_HD_FILER_ZIP_US']].sample(5)

Wall time: 1min 10s


Unnamed: 0,F9_00_HD_FILER_ADDR_US_L1,F9_00_HD_FILER_ADDR_US_L2,F9_00_HD_FILER_CITY_US,F9_00_HD_FILER_ZIP_US
1768051,2890 SOUTH COLORADO BOULEVARD,,DENVER,80222
170851,PO BOX 611,,CONWAY,72033
931329,4855 SEMINOLE DRIVE,,SAN DIEGO,92115
158131,PO Box 1362,,Glenrock,826371362
1467548,1 RUGGED ROAD,,NANTUCKET,2554


<br>Parse another variable and then show some descriptives.

In [39]:
%%time
df['F9_00_HD_FILER_COUNTRY_FRGN'] = df['F9_00_HD_FILER_COUNTRY_FRGN'][:].apply(func, key1='Country', key2='CountryCd')

Wall time: 26.1 s


In [40]:
%%time
df[['F9_00_HD_FILER_ADDR_US_L1', 'F9_00_HD_FILER_ADDR_US_L2', 'F9_00_HD_FILER_CITY_US', 'F9_00_HD_FILER_ZIP_US',
   'F9_00_HD_FILER_COUNTRY_FRGN']].describe().T

Wall time: 32.9 s


Unnamed: 0,count,unique,top,freq
F9_00_HD_FILER_ADDR_US_L1,2102625,433778,2335 NORTH BANK DRIVE,1756
F9_00_HD_FILER_ADDR_US_L2,40287,4344,Suite,13463
F9_00_HD_FILER_CITY_US,2102625,25043,NEW YORK,39579
F9_00_HD_FILER_ZIP_US,2102625,77180,20036,5860
F9_00_HD_FILER_COUNTRY_FRGN,1810,73,CA,540


<br>Here I want to double-check that the parsing was correct for *F9_00_HD_FILER_COUNTRY_FRGN*. There are few observations in the dataset with a value for this variable, so rather than show a random sample of five observations, we take a random sample of five observations that actually have a value for this variable using the ``notnull()`` function.

In [41]:
%%time
df[df['F9_00_HD_FILER_COUNTRY_FRGN'].notnull()][['F9_00_HD_FILER_ADDR_US_L1', 'F9_00_HD_FILER_ADDR_US_L2', 'F9_00_HD_FILER_CITY_US', 'F9_00_HD_FILER_ZIP_US',
   'F9_00_HD_FILER_COUNTRY_FRGN']].sample(5)

Wall time: 451 ms


Unnamed: 0,F9_00_HD_FILER_ADDR_US_L1,F9_00_HD_FILER_ADDR_US_L2,F9_00_HD_FILER_CITY_US,F9_00_HD_FILER_ZIP_US,F9_00_HD_FILER_COUNTRY_FRGN
1409240,,,,,CA
1802579,,,,,CA
872798,,,,,CA
178394,,,,,IS
1254368,,,,,CA


In [42]:
%%time
df[['URL', 'F9_00_HD_FILER_ADDR_US_L1', 'F9_00_HD_FILER_ADDR_US_L2', 'F9_00_HD_FILER_CITY_US',
    'F9_00_HD_FILER_ZIP_US', 'F9_00_HD_FILER_COUNTRY_FRGN']].to_pickle('efile address variables.pkl')

Wall time: 5.21 s


<br>Process variable for state. As with the four variables parsed above this is another 'two-step' parsing process.

In [43]:
%%time
df['F9_00_HD_FILER_STATE_US'] = df['Filer'][:].apply(func_onekey, key1='USAddress')

Wall time: 9.41 s


In [44]:
%%time
df['F9_00_HD_FILER_STATE_US'] = df['F9_00_HD_FILER_STATE_US'][:].apply(func, key1='State', key2='StateAbbreviationCd')

Wall time: 9.94 s


In [45]:
%%time
print(len(df[df['Filer'].notnull()]))
print(len(df[df['F9_00_HD_FILER_STATE_US'].notnull()]))

2104435
2102625
Wall time: 57.8 s


In [123]:
2104435-2102625

1810

<br>The code block above shows that there are 1,810 observations with a value for *Filer* that do not have a value for US state. To check what is going on here, I will run the following code block to show me the *Filer* column for a random sample of five observations that have a value for *Filer* but are missing the state variable. As you can see, the missing values all have a foreign address so we are in good shape.

In [46]:
df[(df['Filer'].notnull())&(df['F9_00_HD_FILER_STATE_US'].isnull())][['Filer']].sample(5)

Unnamed: 0,Filer
950575,"{'EIN': '980506316', 'BusinessName': {'BusinessNameLine1Txt': 'GSM ASSOCIATION'}, 'InCareOfNm': '% OONAGH STEIN', 'BusinessNameControlTxt': 'GSMA', 'PhoneNum': '2073560600', 'ForeignAddress': {'AddressLine1Txt': 'FLOOR 2 WALBROOK BLDG 25 WALBROOK..."
159777,"{'EIN': '980160122', 'Name': {'BusinessNameLine1': 'Eshel-the Assn for the Planning & Development', 'BusinessNameLine2': 'of Services for the Aged in Israel'}, 'InCareOfName': '% ELIYAHU EREZ', 'NameControl': 'ESHE', 'Phone': '2126876200', 'Forei..."
420180,"{'EIN': '980437032', 'BusinessName': {'BusinessNameLine1': 'CANADIAN LUNG ASSOCIATION'}, 'BusinessNameControlTxt': 'CANA', 'PhoneNum': '6135696411', 'ForeignAddress': {'AddressLine1': '1750 Courtwood Crescent', 'AddressLine2': 'Suite 300', 'City'..."
1128020,"{'EIN': '391522897', 'BusinessName': {'BusinessNameLine1Txt': 'THE BRITISH NORTH AMERICA', 'BusinessNameLine2Txt': 'PHILATELIC SOCIETY LTD'}, 'BusinessNameControlTxt': 'BRIT', 'PhoneNum': '4104422040', 'ForeignAddress': {'AddressLine1Txt': '15 BR..."
1163810,"{'EIN': '981253233', 'BusinessName': {'BusinessNameLine1Txt': 'MUSEUM KAMPA - THE JAN AND MEDA', 'BusinessNameLine2Txt': 'MLADEK FOUNDATION'}, 'BusinessNameControlTxt': 'MUSE', 'PhoneNum': '3017188920', 'ForeignAddress': {'AddressLine1Txt': 'U SO..."


<br>Now show a random sample of the state variable for five observations.

In [47]:
df[['F9_00_HD_FILER_STATE_US']].sample(5)

Unnamed: 0,F9_00_HD_FILER_STATE_US
1194405,MA
1919918,NM
1861903,OR
935191,TX
773352,DC


<br>For further verification we can also check a sample of observations that have a value for *F9_00_HD_FILER_COUNTRY_FRGN*. All of the US variables are empty for these five observations so, again, we are in good shape.

In [49]:
df[df['F9_00_HD_FILER_COUNTRY_FRGN'].notnull()][['F9_00_HD_FILER_ADDR_US_L1', 'F9_00_HD_FILER_ADDR_US_L2', 'F9_00_HD_FILER_CITY_US', 'F9_00_HD_FILER_STATE_US',
    'F9_00_HD_FILER_ZIP_US', 'F9_00_HD_FILER_COUNTRY_FRGN']].sample(5)

Unnamed: 0,F9_00_HD_FILER_ADDR_US_L1,F9_00_HD_FILER_ADDR_US_L2,F9_00_HD_FILER_CITY_US,F9_00_HD_FILER_STATE_US,F9_00_HD_FILER_ZIP_US,F9_00_HD_FILER_COUNTRY_FRGN
790731,,,,,,CA
113838,,,,,,CA
1287652,,,,,,CA
627315,,,,,,CA
267668,,,,,,IS


In [50]:
df[['F9_00_HD_FILER_ADDR_US_L1', 'F9_00_HD_FILER_ADDR_US_L2', 'F9_00_HD_FILER_CITY_US', 'F9_00_HD_FILER_STATE_US',
    'F9_00_HD_FILER_ZIP_US', 'F9_00_HD_FILER_COUNTRY_FRGN']].sample(5)

Unnamed: 0,F9_00_HD_FILER_ADDR_US_L1,F9_00_HD_FILER_ADDR_US_L2,F9_00_HD_FILER_CITY_US,F9_00_HD_FILER_STATE_US,F9_00_HD_FILER_ZIP_US,F9_00_HD_FILER_COUNTRY_FRGN
296124,10506 SW 184 TERRACE,,Miami,FL,331576760,
1805838,100 LANCASTER AVENUE,,WYNNEWOOD,PA,19096,
903582,P O BOX 889,,SANTA FE,TX,775100889,
386285,PO BOX 176,,FARMINGDALE,NY,11735,
1176824,202 CANAL STREET SUITE 500,,NEW YORK,NY,10013,


<br>Now let's drop *F9_00_HD_FILER_STATE_US* and other 'Filer' variables from *new_variables_df* because they are now dealt with above.

In [59]:
new_variables_df[:6]

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,cardinality,len,len_subkeys
0,F9_00_HD_FILER_ADDR_US_L1,[Filer],[USAddress],StreetAddressType,,1,1
1,F9_00_HD_FILER_ADDR_US_L2,[Filer],[USAddress],StreetAddressType,,1,1
2,F9_00_HD_FILER_CITY_US,[Filer],[USAddress],CityType,,1,1
3,F9_00_HD_FILER_COUNTRY_FRGN,[Filer],[ForeignAddress],CountryType,,1,1
4,F9_00_HD_FILER_STATE_US,[Filer],[USAddress],StateType,,1,1
5,F9_00_HD_FILER_ZIP_US,[Filer],[USAddress],ZIPCodeType,,1,1


<br>The next line drops the row in *new_variables_df* that has an index value of 1 (the second row above).

In [60]:
new_variables_df = new_variables_df.drop(1) 
new_variables_df[:6]

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,cardinality,len,len_subkeys
0,F9_00_HD_FILER_ADDR_US_L1,[Filer],[USAddress],StreetAddressType,,1,1
2,F9_00_HD_FILER_CITY_US,[Filer],[USAddress],CityType,,1,1
3,F9_00_HD_FILER_COUNTRY_FRGN,[Filer],[ForeignAddress],CountryType,,1,1
4,F9_00_HD_FILER_STATE_US,[Filer],[USAddress],StateType,,1,1
5,F9_00_HD_FILER_ZIP_US,[Filer],[USAddress],ZIPCodeType,,1,1
6,F9_00_HD_SIGNING_OFFICER_SIGNTR,"[BusinessOfficerGrp, Officer]","[SignatureDt, DateSigned]",DateType,,2,2


<br>In the next two code blocks we'll drop the first, third, fourth, fifth, and sixth rows from *new_variables_df* (recall that Python starts counting with '0'.

In [61]:
new_variables_df = new_variables_df.drop(0) 
new_variables_df = new_variables_df.drop(2) 
new_variables_df = new_variables_df.drop(3) 
new_variables_df[:2]

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,cardinality,len,len_subkeys
4,F9_00_HD_FILER_STATE_US,[Filer],[USAddress],StateType,,1,1
5,F9_00_HD_FILER_ZIP_US,[Filer],[USAddress],ZIPCodeType,,1,1


In [62]:
new_variables_df = new_variables_df.drop(4) 
new_variables_df = new_variables_df.drop(5) 
new_variables_df[:2]

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,cardinality,len,len_subkeys
6,F9_00_HD_SIGNING_OFFICER_SIGNTR,"[BusinessOfficerGrp, Officer]","[SignatureDt, DateSigned]",DateType,,2,2
7,F9_03_PC_PROG_SVC_ACC_2_CODE,"[Activity2, ProgSrvcAccomActy2Grp]",[ActivityCode],IntegerNNType,,2,1


#### Also drop *Filer* from our PANDAS dataset

In [63]:
df = df.drop('Filer', axis=1)

### Loop over variables with a single sub-key
Now we'll proceed to a more efficient looping process. First, we will loop over the over the three variables that have a single sub-key and apply our 'one key' function to each of the three variables in turn.

In [64]:
new_variables_df[new_variables_df['len_subkeys']!=2]

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,cardinality,len,len_subkeys
7,F9_03_PC_PROG_SVC_ACC_2_CODE,"[Activity2, ProgSrvcAccomActy2Grp]",[ActivityCode],IntegerNNType,,2,1
12,F9_03_PC_PROG_SVC_ACC_3_CODE,"[ProgSrvcAccomActy3Grp, Activity3]",[ActivityCode],IntegerNNType,,2,1
98,F9_10_PC_LOANS_FROM_OFFICERS_EOY,"[LoansFromOfficersDirectorsGrp, LoansFromOfficersDirectors]",[EOYAmt],USAmountType,,2,1


<br>In the following loop we will loop over each of the three above variables in *new_variables_df* and, taking the variable name and associated sub-key from *new_variables_df*, we will apply our custom ``func_onekey`` function to that variable in our e-filing dataset. 

In [65]:
%%time
for index, row in new_variables_df[new_variables_df['len_subkeys']!=2].iterrows():
    variable = row['variable_name_new']
    keys = row['sub_keys']
    key = keys[0]
    #key2 = keys[1]
    print(variable, key)
    #print(type(row['variable_name_new']))
    #df.loc[df.index[index], row['variable_name_new']] = 
    #df.loc[df.index[45], 'reptrak100-rank-2013 (binary)'] = 0
    
    df[variable] = df[variable][:].apply(func_onekey, key1=key)

F9_03_PC_PROG_SVC_ACC_2_CODE ActivityCode
F9_03_PC_PROG_SVC_ACC_3_CODE ActivityCode
F9_10_PC_LOANS_FROM_OFFICERS_EOY EOYAmt
Wall time: 49.1 s


<br>We can print out frequencies for the first two variables to verify that the above function worked correctly.

In [66]:
df['F9_03_PC_PROG_SVC_ACC_2_CODE'].value_counts()

522130    36
2         35
900099    29
624100    19
611710    19
          ..
821103     1
237310     1
534490     1
624190     1
711190     1
Name: F9_03_PC_PROG_SVC_ACC_2_CODE, Length: 76, dtype: int64

In [67]:
df['F9_03_PC_PROG_SVC_ACC_3_CODE'].value_counts()

900099    25
522130    25
3         22
624100    16
611710    15
624410     8
611600     7
624200     6
611000     6
541900     5
501        4
90099      4
713940     4
62300      3
713990     3
621300     3
561420     3
711110     3
621400     3
453310     2
519100     2
813910     2
624229     2
811000     2
711190     2
561250     2
611110     2
561499     2
541700     1
524298     1
621999     1
924120     1
525100     1
07         1
712100     1
624310     1
531100     1
522100     1
812900     1
813000     1
561700     1
003        1
623000     1
541199     1
237310     1
821103     1
531390     1
812930     1
712110     1
7160       1
541190     1
624190     1
711130     1
485410     1
541710     1
999999     1
100        1
623990     1
Name: F9_03_PC_PROG_SVC_ACC_3_CODE, dtype: int64

<br>Inspect the data type for the above three variables. All three are 'object' (text) type for now.

In [68]:
%%time
df[['F9_10_PC_LOANS_FROM_OFFICERS_EOY', 'F9_03_PC_PROG_SVC_ACC_2_CODE', 'F9_03_PC_PROG_SVC_ACC_3_CODE']].dtypes

F9_10_PC_LOANS_FROM_OFFICERS_EOY    object
F9_03_PC_PROG_SVC_ACC_2_CODE        object
F9_03_PC_PROG_SVC_ACC_3_CODE        object
dtype: object

<br>Look at a sample of 10 observations for these three variables.

In [83]:
%%time
df[['F9_10_PC_LOANS_FROM_OFFICERS_EOY', 'F9_03_PC_PROG_SVC_ACC_2_CODE', 'F9_03_PC_PROG_SVC_ACC_3_CODE']].sample(10)

Wall time: 185 ms


Unnamed: 0,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_03_PC_PROG_SVC_ACC_2_CODE,F9_03_PC_PROG_SVC_ACC_3_CODE
831764,,,
1859394,,,
804378,0.0,,
565735,19900.0,,
1030038,,,
116002,,,
1203388,,,
1148635,,,
1654884,,,
1517716,,,


<br>Show the three variables in *new_variables_df*. We're using the ``isin()`` function to apply a filter (any rows where the value of *variable_name_new* matches the three in the list.

In [84]:
new_variables_df[new_variables_df['variable_name_new'].isin(['F9_10_PC_LOANS_FROM_OFFICERS_EOY', 'F9_03_PC_PROG_SVC_ACC_2_CODE', 'F9_03_PC_PROG_SVC_ACC_3_CODE'])]

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,cardinality,len,len_subkeys
7,F9_03_PC_PROG_SVC_ACC_2_CODE,"[Activity2, ProgSrvcAccomActy2Grp]",[ActivityCode],IntegerNNType,,2,1
12,F9_03_PC_PROG_SVC_ACC_3_CODE,"[ProgSrvcAccomActy3Grp, Activity3]",[ActivityCode],IntegerNNType,,2,1
98,F9_10_PC_LOANS_FROM_OFFICERS_EOY,"[LoansFromOfficersDirectorsGrp, LoansFromOfficersDirectors]",[EOYAmt],USAmountType,,2,1


<br>Drop *F9_03_PC_PROG_SVC_ACC_2_CODE* and *F9_03_PC_PROG_SVC_ACC_3_CODE* from ``new_variables_df``. They are already dealt with above.

In [86]:
new_variables_df = new_variables_df.drop(7)
new_variables_df = new_variables_df.drop(12)
new_variables_df[:8]

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,cardinality,len,len_subkeys
6,F9_00_HD_SIGNING_OFFICER_SIGNTR,"[BusinessOfficerGrp, Officer]","[SignatureDt, DateSigned]",DateType,,2,2
8,F9_03_PC_PROG_SVC_ACC_2_DESC,"[Activity2, ProgSrvcAccomActy2Grp]","[Desc, Description]",ExplanationType,,2,2
9,F9_03_PC_PROG_SVC_ACC_2_EXP,"[Activity2, ProgSrvcAccomActy2Grp]","[Expense, ExpenseAmt]",USAmountType,,2,2
10,F9_03_PC_PROG_SVC_ACC_2_GRNT,"[Activity2, ProgSrvcAccomActy2Grp]","[Grants, GrantAmt]",USAmountType,,2,2
11,F9_03_PC_PROG_SVC_ACC_2_REV,"[Activity2, ProgSrvcAccomActy2Grp]","[RevenueAmt, Revenue]",USAmountType,,2,2
13,F9_03_PC_PROG_SVC_ACC_3_DESC,"[ProgSrvcAccomActy3Grp, Activity3]","[Desc, Description]",ExplanationType,,2,2
14,F9_03_PC_PROG_SVC_ACC_3_EXP,"[ProgSrvcAccomActy3Grp, Activity3]","[Expense, ExpenseAmt]",USAmountType,,2,2
15,F9_03_PC_PROG_SVC_ACC_3_GRNT,"[ProgSrvcAccomActy3Grp, Activity3]","[Grants, GrantAmt]",USAmountType,,2,2


<br>Drop *F9_10_PC_LOANS_FROM_OFFICERS_EOY* from ``new_variables_df``, which is also dealt with above.

In [89]:
new_variables_df[90:91]

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,cardinality,len,len_subkeys
98,F9_10_PC_LOANS_FROM_OFFICERS_EOY,"[LoansFromOfficersDirectorsGrp, LoansFromOfficersDirectors]",[EOYAmt],USAmountType,,2,1


In [90]:
new_variables_df = new_variables_df.drop(98)
new_variables_df[88:92]

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,cardinality,len,len_subkeys
96,F9_10_PC_INVEST_PROG_RELTD_EOY,"[InvestmentsProgramRelatedGrp, InvestmentsProgramRelated]","[EOY, EOYAmt]",USAmountType,,2,2
97,F9_10_PC_INVEST_PUB_TRADED_EOY,"[InvestmentsPubTradedSecGrp, InvestmentsPubTradedSecurities]","[EOY, EOYAmt]",USAmountType,,2,2
99,F9_10_PC_OTHER_LIABILITIES_EOY,"[OtherLiabilities, OtherLiabilitiesGrp]","[EOY, EOYAmt]",USAmountType,,2,2
100,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,"[RtnEarnEndowmentIncmOthFndsGrp, RetainedEarningsEndowmentEtc]","[EOY, EOYAmt]",USAmountType,,2,2


#### Loop and apply main function
All of the remaining variables in *new_variables_df* have two sub-keys, as verified by the empty data outputted by the following line.

In [92]:
new_variables_df[new_variables_df['len_subkeys']!=2]

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,cardinality,len,len_subkeys


<br>I'm pasting in our custom function ``func`` here again to have it handy.

In [94]:
def func(x, key1, key2):
    if pd.isnull(x):
        return np.nan
    #else: 
    #    mydict = ast.literal_eval(x)
    elif key1 in x.keys():
        return x[key1]
    elif key2 in x.keys():
        return x[key2]
    else:
        return np.nan

<br>We can see that there are 100 variables with two sub-keys. We will process all 100 in one loop.

In [95]:
print(len(new_variables_df[new_variables_df['len_subkeys']!=2]))
print(len(new_variables_df[(new_variables_df['len_subkeys']==2)]))

0
100
100


In [96]:
new_variables_df[(new_variables_df['len_subkeys']==2)]

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,cardinality,len,len_subkeys
6,F9_00_HD_SIGNING_OFFICER_SIGNTR,"[BusinessOfficerGrp, Officer]","[SignatureDt, DateSigned]",DateType,,2,2
8,F9_03_PC_PROG_SVC_ACC_2_DESC,"[Activity2, ProgSrvcAccomActy2Grp]","[Desc, Description]",ExplanationType,,2,2
9,F9_03_PC_PROG_SVC_ACC_2_EXP,"[Activity2, ProgSrvcAccomActy2Grp]","[Expense, ExpenseAmt]",USAmountType,,2,2
10,F9_03_PC_PROG_SVC_ACC_2_GRNT,"[Activity2, ProgSrvcAccomActy2Grp]","[Grants, GrantAmt]",USAmountType,,2,2
11,F9_03_PC_PROG_SVC_ACC_2_REV,"[Activity2, ProgSrvcAccomActy2Grp]","[RevenueAmt, Revenue]",USAmountType,,2,2
...,...,...,...,...,...,...,...
104,F9_10_PC_SECURE_MORT_NOTES_EOY,"[MortNotesPyblSecuredInvestProp, MortgNotesPyblScrdInvstPropGrp]","[EOY, EOYAmt]",USAmountType,,2,2
105,F9_10_PC_UNSECURED_LOANS_EOY,"[UnsecuredNotesLoansPayableGrp, UnsecuredNotesLoansPayable]","[EOY, EOYAmt]",USAmountType,,2,2
106,F9_10_PC_UNSECURED_NOTES_BOY,"[UnsecuredNotesLoansPayableGrp, UnsecuredNotesLoansPayable]","[BOYAmt, BOY]",USAmountType,,2,2
107,F9_10_PC_UNSECURED_NOTES_EOY,"[UnsecuredNotesLoansPayableGrp, UnsecuredNotesLoansPayable]","[EOY, EOYAmt]",USAmountType,,2,2


### Main Loop
Now we will loop over all 100 variables that have two subkeys. For each variable we will apply our custom function ``func``. If you're wondering why there are two keys, recall that each variable, for example ``F9_09_PC_COMP_OFFICERS_TOTAL ``, was the result of combining two different XML sections from the e-file data. Not only were there different variable names, but the nested variable names were distinct. For this example variable there are two different variable 'keys'' -- ``Total`` and ``TotalAmt`` so we will take the values from either of these two keys. 

Accordingly, below you will see for each variable a printout of the variable name and the two keys from which we are grabbing data. 

In [None]:
%%time
for index, row in new_variables_df[new_variables_df['len_subkeys']==2][:].iterrows():
    variable = row['variable_name_new']
    keys = row['sub_keys']
    key1 = keys[0]
    key2 = keys[1]
    print(variable, key1, key2)
    #print(type(row['variable_name_new']))
    #df.loc[df.index[index], row['variable_name_new']] = 
    #df.loc[df.index[45], 'reptrak100-rank-2013 (binary)'] = 0
    
    df[variable] = df[variable][:].apply(func, key1=key1, key2=key2)
    #df[variable] = df[variable].astype('float')

<br>Note that the above will cause an error when we get to the variable *F9_09_EXP_OTH_TOT*. The reason, as you can see below, is that this is the only variable with a *cardinality* value of 'MANY'. So, before running the code, you may want to simply delete *F9_09_EXP_OTH_TOT* from *new_variables_df* before proceeding. 

In case you're interested, the error you would see is the following: ``ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()``

In [99]:
new_variables_df[new_variables_df['len_subkeys']==2][27:28]

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,cardinality,len,len_subkeys
35,F9_09_EXP_OTH_TOT,"[OtherExpensesGrp, OtherExpenses]","[Total, TotalAmt]",USAmountType,MANY,2,2


#### The Fix: Drop *F9_09_EXP_OTH_TOT* from *new_variables_df* and *df*

In [100]:
new_variables_df = new_variables_df.drop(35)

<br>Alternatively, you can skip this variable by running our loop twice. First, modify the second row to stop looping at the 28th element:

``for index, row in new_variables_df[new_variables_df['len_subkeys']==2][:27].iterrows():``

and then run it again startign at the 29th element:


``for index, row in new_variables_df[new_variables_df['len_subkeys']==2][28:].iterrows():``

The easiest solution, though, is to simply delete *F9_09_EXP_OTH_TOT* from *new_variables_df* and *df* and then run the main loop above. There are few research cases where you might want this variable.

<br>Here we can see a sample of five rows for *F9_09_EXP_OTH_TOT* and *F9_09_EXP_OTH_OTH_TOT*.

In [103]:
df[['F9_09_EXP_OTH_TOT', 'F9_09_EXP_OTH_OTH_TOT']].sample(5)

Unnamed: 0,F9_09_EXP_OTH_TOT,F9_09_EXP_OTH_OTH_TOT
1840359,"[{'Desc': 'Activities', 'TotalAmt': '3027', 'ProgramServicesAmt': '3027'}, {'Desc': 'Volunteer Recruiting', 'TotalAmt': '45', 'ProgramServicesAmt': '45'}, {'Desc': 'Bank and Processing Fees', 'TotalAmt': '6238', 'ManagementAndGeneralAmt': '4497',...",28247
807671,"[{'Desc': 'ACTIVITIES EXPENSE', 'TotalAmt': '3137', 'ProgramServicesAmt': '3137'}, {'Desc': 'MEMBERSHIP EXPENSE', 'TotalAmt': '2543', 'ProgramServicesAmt': '2543'}, {'Desc': 'POSTAGE & PRINTING', 'TotalAmt': '1629', 'ProgramServicesAmt': '1629'},...",1511
353620,"[{'Description': 'PURCHASED SERVICES', 'Total': '1499421', 'ProgramServices': '1407841', 'ManagementAndGeneral': '91489', 'Fundraising': '91'}, {'Description': 'EMPLOYEE TRAINING', 'Total': '22521', 'ProgramServices': '21757', 'ManagementAndGener...",7667
289806,"[{'Description': 'REPAIRS & MAINTENANCE', 'Total': '44318', 'ProgramServices': '37504', 'ManagementAndGeneral': '6814'}, {'Description': 'TRUCK & EQUIPMENT EXPEN', 'Total': '5703', 'ProgramServices': '5703'}, {'Description': 'SUMMER DISCOVERY EXP...",10596
1170319,"[{'Desc': 'MISCELLANEOUS', 'TotalAmt': '38330', 'ProgramServicesAmt': '7322', 'ManagementAndGeneralAmt': '27070', 'FundraisingAmt': '3938'}, {'Desc': 'MEDICAL SUPPLIES', 'TotalAmt': '21445', 'ProgramServicesAmt': '21445', 'ManagementAndGeneralAmt...",5734


<br>Let's drop *F9_09_EXP_OTH_TOT* from our PANDAS dataset.

In [104]:
df = df.drop('F9_09_EXP_OTH_TOT', axis=1)

<br>Create a list of all 99 (now) sub-key variables.

In [105]:
subkey_vars = new_variables_df[new_variables_df['sub_keys'].notnull()]['variable_name_new'].tolist()
print(len(subkey_vars))

99


<br>Show a sample of five rows for each of the 99 variables. As you can see none is still 'nested'. Our variable transformations conducted in our main loop above have worked. 

In [106]:
print(len(df.columns), len(df))
print(len(df[subkey_vars].columns), len(df))
df[subkey_vars].sample(10)

293 2104435
99 2104435


Unnamed: 0,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_03_PC_PROG_SVC_ACC_2_DESC,F9_03_PC_PROG_SVC_ACC_2_EXP,F9_03_PC_PROG_SVC_ACC_2_GRNT,F9_03_PC_PROG_SVC_ACC_2_REV,F9_03_PC_PROG_SVC_ACC_3_DESC,F9_03_PC_PROG_SVC_ACC_3_EXP,F9_03_PC_PROG_SVC_ACC_3_GRNT,F9_03_PC_PROG_SVC_ACC_3_REV,F9_08_PC_TOTAL_REVENUE,F9_09_EXP_AD_PROMO_TOT,F9_09_EXP_BENF_PAID_MEMB_TOT,F9_09_EXP_CONF_MEETING_TOT,F9_09_EXP_DEPREC_FUNDR,F9_09_EXP_DEPREC_MAG,F9_09_EXP_DEPREC_PROG,F9_09_EXP_DEPREC_TOT,F9_09_EXP_GRANT_FRGN_TOT,F9_09_EXP_GRANT_INDIV_DMSTC_TOT,F9_09_EXP_GRANT_ORG_DMSTC_TOT,F9_09_EXP_INFO_TECH_TOT,F9_09_EXP_INSURANCE_TOT,F9_09_EXP_INTEREST_TOT,F9_09_EXP_JOINT_COSTS_TOT,F9_09_EXP_OCCUPANCY_TOT,F9_09_EXP_OFFICE_TOT,F9_09_EXP_OTH_OTH_TOT,F9_09_EXP_ROY_TOT,F9_09_EXP_TRAVEL_ENTRTNMNT_TOT,F9_09_EXP_TRAVEL_TOT,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_FR_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYMENT_TO_AFFILIATES,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_ASSETS_ACC_NET_EOY,F9_10_ASSETS_EXP_PREPAID_EOY,F9_10_ASSETS_INTANGIB_EOY,F9_10_ASSETS_INVENT_SALE_EOY,F9_10_ASSETS_LESS_DEPREC_EOY,F9_10_ASSETS_LOANS_DISQUAL_EOY,F9_10_ASSETS_NOTES_LOANS_NET_EOY,F9_10_ASSETS_OTH_EOY,F9_10_ASSETS_PLEDGES_NET_EOY,F9_10_LIAB_ACC_PAYABLE_EOY,F9_10_LIAB_GRANTS_PAYABLE_EOY,F9_10_LIAB_LOANS_OFF_EOY,F9_10_LIAB_REV_DEFERRED_EOY,F9_10_NAFB_RESTRICT_PERM_EOY,F9_10_NAFB_RESTRICT_TEMP_EOY,F9_10_NAFB_UNRESTRICT_EOY,F9_10_PC_BOND_LIABILITY_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_ESCROW_LIABILITY_EOY,F9_10_PC_INVEST_OTHER_SEC_EOY,F9_10_PC_INVEST_PROG_RELTD_EOY,F9_10_PC_INVEST_PUB_TRADED_EOY,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_SECURE_MORT_NOTES_EOY,F9_10_PC_UNSECURED_LOANS_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY
211925,2013-05-30,,,,,,,,,39736,0.0,0.0,0.0,,2547.0,,2547.0,0.0,0.0,0.0,0.0,0.0,29.0,,17199.0,5798.0,32931.0,0.0,0.0,151.0,,,,0.0,,,,0.0,705.0,,0.0,0.0,0.0,0.0,4215.0,,,,0.0,,,,0.0,0.0,,,,0.0,,,,0.0,67300,0,11696,55604,2885.0,,,,345.0,,,,,,,,,,,3545.0,,4494.0,350.0,,,,,35.0,,,,,,,,,3580
167903,2012-10-25,POLICY DEVELOPMENT - THE DEVELOPMENT AND DISSEMINATION OF BROADER POLICIES TO TIE TOGETHER RECOMMENDATIONS FROM MULTIPLE SPECIFIC RESEARCH AND POLICY PROJECTS; INCREASING LINKAGES AND ALLIANCES WITH OTHER ORGANIZATIONS AROUND POLICIES TARGETED AT...,235413.0,,,"PUBLIC EDUCATION - COMMUNICATION OF EPI'S RESEARCH FINDINGS AND POLICY RECOMMENDATIONS THROUGH A BROAD RANGE OF CHANNELS INCLUDING PUBLICATIONS, MEDIA, PUBLIC EVENTS, WEBSITES AND ONLINE FORUMS IN ORDER TO INCREASE PUBLIC AWARENESS OF THE INSTITU...",104047.0,,,5844652,,,174878.0,10346.0,12932.0,92898.0,116176.0,,,136320.0,70812.0,4654.0,,,813771.0,,107912.0,,,87387.0,,,,,136913.0,11008.0,983287.0,1131208.0,,,,2600.0,,,130696.0,20505.0,21144.0,210035.0,251684.0,188818.0,263341.0,1944373.0,2396532.0,,20630.0,21275.0,211329.0,253234.0,24011.0,24760.0,245955.0,294726.0,6727718,498853,608598,5620267,47284.0,59551.0,,,250083.0,,,,1717278.0,434205.0,,,,,3458878.0,1543566.0,,302.0,302.0,,,,,354102.0,,4285846.0,3716253.0,,,,,,5790751
884364,2016-08-17,,,,,,,,,282612,,,4086.0,,,,,,,,,,,,,1104.0,202093.0,,,877.0,,,,,0.0,0.0,0.0,0.0,,,,,,,,,,,,0.0,0.0,20577.0,20577.0,,0.0,0.0,1575.0,1575.0,,,,,263617,0,9424,254193,11717.0,,,17871.0,,,,,,2751.0,,,,,,67155.0,,25253.0,40318.0,,,,,,,,,,,,,,69906
316235,2013-04-29,,,,,,,,,461558,0.0,0.0,0.0,,,215.0,215.0,0.0,0.0,902743.0,179.0,0.0,0.0,,0.0,1254.0,0.0,0.0,0.0,2600.0,,,,0.0,,,,0.0,1550.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,0.0,,,,0.0,0.0,,,,0.0,,,,0.0,914416,0,179,914237,0.0,0.0,0.0,0.0,1506.0,0.0,0.0,0.0,0.0,,,,,,,243973.0,,453222.0,109370.0,,0.0,0.0,0.0,,,242774.0,133097.0,,,,,,243973
1842557,2020-08-10,In-home family preservation services are provided to help prevent placement or disruption of placements,963418.0,,,Therapeutic services are provided to the children and their families.,1237282.0,,,8424269,0.0,0.0,0.0,,27609.0,25517.0,53126.0,0.0,0.0,0.0,0.0,141358.0,8862.0,,94392.0,242080.0,13.0,0.0,0.0,0.0,,,,0.0,,2703.0,110675.0,113378.0,41545.0,0.0,0.0,3346.0,0.0,0.0,163853.0,,48413.0,184005.0,232418.0,,58888.0,2410778.0,2469666.0,0.0,,37755.0,154706.0,192461.0,,3583.0,29050.0,32633.0,7416468,0,260469,7155999,1636549.0,60708.0,0.0,0.0,597376.0,0.0,0.0,952.0,0.0,562470.0,0.0,0.0,0.0,,,,0.0,156310.0,480371.0,0.0,356257.0,0.0,0.0,0.0,,150037.0,446780.0,173746.0,173746.0,0.0,0.0,0.0,3578993
283598,2013-06-19,,,,,,,,,126710,,,,,,,,3434.0,,8960.0,,,,,54500.0,,8073.0,,,,,,,,,,,,200.0,,,,,,,,,,,,,,,,,,,,,,,,111176,0,868,110308,,,,,,,,36000.0,,,,,,,,,,28336.0,28486.0,,,,,,161911.0,123041.0,138425.0,,,,,,202911
1974044,2020-07-15,,,,,,,,,323147,0.0,0.0,9477.0,,,1049.0,1049.0,0.0,5000.0,0.0,0.0,3810.0,0.0,,12696.0,3341.0,1881.0,0.0,0.0,0.0,,,,0.0,,,,49000.0,3250.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,0.0,,,,0.0,208922.0,,,6843.0,6843.0,,,,0.0,329531,0,0,280531,1833.0,1000.0,0.0,0.0,2471.0,0.0,0.0,750.0,0.0,31558.0,,,,,,409477.0,,72320.0,80232.0,,0.0,0.0,0.0,,,353544.0,354749.0,,,,,,441035
1438862,2018-11-13,,,,,,,,,80081,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,159016.0,11445.0,,,3466425.0,,,75244.0,,,,,,,,3686742.0,,,,,,,,25388.0,,,,,,,,,3712130
1517871,2019-05-15,"IMOM PROGRAM PURPOSE:IMOM IS OUR MOTHERHOOD PROGRAM DESIGNED TO HELP MOTHERS GROW WISE, HEALTHY, PURPOSE-MINDED, AND RELATIONALLY-FOCUSED CHILDREN. OUR IMOM RESOURCES, DAILY EMAILS AND IMOM.COM WEBSITE GIVES MOTHERS THE TOOLS THEY NEED TO CONNECT...",273949.0,0.0,0.0,"FAMILY MINUTE PURPOSE:THE FAMILY MINUTE WITH MARK MERRILL IS OUR POPULAR DAILY RADIO FEATURE AIRING ON 370 RADIO STATIONS AND REACHING ABOUT 6,015,311 PEOPLE EACH WEEK WITH PRACTICAL EVERYDAY ADVICE ON MARRIAGE, PARENTING AND FAMILY RELATIONSHIPS...",318394.0,0.0,0.0,4399269,477971.0,,14564.0,2009.0,1637.0,15706.0,19352.0,,,,183601.0,12764.0,,,105139.0,132608.0,6686.0,,,64814.0,,,,,48675.0,45393.0,405043.0,499111.0,10000.0,,,20001.0,,,963951.0,9990.0,9316.0,83127.0,102433.0,104496.0,97451.0,869557.0,1071504.0,,10061.0,9383.0,83725.0,103169.0,,,,,3973964,253948,180981,3539035,316644.0,38000.0,,,22728.0,,,,3805.0,74897.0,,,250000.0,,,1068764.0,,553908.0,1012163.0,,,,,,,231.0,321.0,,,,,,1393661
1523790,2019-05-10,,,,,,,,,86001,,,,,,,,,48250.0,,985.0,1598.0,,,,1408.0,,,,,,,,,,,,,5174.0,,9659.0,,,,,,,,,,,,,,,,,,,,,,80618,1564,17614,61440,,,,,,,,,,,,,,,704386.0,279543.0,,3916.0,12026.0,,,,958714.0,,,16488.0,13189.0,,,,,,983929


### Look at 501(c)(3)s

In [115]:
print('# of columns:', len(df.columns))
print('# of observations:', len(df))

# of columns: 293
# of observations: 2104435


In [116]:
df['501c3'].value_counts()

1    1610772
0     493663
Name: 501c3, dtype: int64

In [117]:
print(len(df[df['501c3']==1]))

1610772


#### Create and save list of EINs for BMF File

In [118]:
ein_list_2022 = df[df['501c3']==1]['EIN'].tolist()
print(len(ein_list_2022))
print(len(set(ein_list_2022)))
ein_list = list(set(ein_list_2022))
print(len(ein_list_2022))

1610772
275192
1610772


In [119]:
import json
with open('ein_list_501c3.json', 'w') as fp:
    json.dump(ein_list_2022, fp)

#### Save DF

In [120]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df.to_pickle('all filings August 2022 - all control variables (with parsed sub-key variables).pkl.gz', compression='gzip')

Current date and time :  2022-08-16 20:05:24 

Wall time: 36min 9s
