# Overview

*Main purpose of notebook:* I created versions of the data with the null values filled with zeros. 

- Read in *concordance_VERIFIED.xlsx* in order to access the *fill_null* column
    - Collapse to *new_variables_df* then use that DF
    - Note that data verifications are done at the beginning of this notebook; specifically, I looked at descriptives for all variables to see which ones had null values that can be filled with zeros. For most if not all of the 'excluded
    variables (such as date variables and 501c 'type' variables), it is an obvious decision. 
    - Based on the analyses, I then saved a new column in *concordance_VERIFIED.xlsx* called 'fill_null' (column was filled out in Excel)

- Read in DF: 
    - *all filings nov. 2020 - all control variables (with parsed sub-key variables and reformatted types).pkl*
    
- Identify columns with missing data:
    - missingcols = list(df.columns[df.isnull().any()])
    - the above list is then refined to exclude columns where *fill_null* = 'Do not fill null'
    - Write function to fill null values and then loop over *missing_cols* and apply

- Fixed one variable:
    - *F9_00_HD_SPECIAL_CONDITION_DESC* had each line as a different list, so I combined them into one text block.
    
- Fixed *OrganizationName* for one row

- File saved *with* null values filled:
    - *all filings nov. 2020 - all control variables (with parsed sub-key variables and reformatted types and fillnull).pkl*
    

Notes:
- I no longer fix *problem_cols* (e.g., *F9_00_HD_EXEMPT_STATUS_501C*, *F9_12_PC_ACCTG_METHOD_OTHER*) nor change data types in this notebook
- I also no longer change any data types here


To Do:
- Decide when to limit to 501c3s (probably a later notebook)

# Load Packages

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

In [2]:
print(pd.__version__)

1.0.1


In [3]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)

#### Set working directory

In [4]:
cd "C:\\Users\\Gregory\\IRS 990 Control Variables\\"

C:\Users\Gregory\IRS 990 Control Variables


# Read 990 DB into PANDAS DF
We can modify the above code block to read all filings into a PANDAS dataframe.

In [5]:
%%time
df = pd.read_pickle('all filings nov. 2020 - all control variables (with parsed sub-key variables and reformatted types).pkl')
print('# of columns:', len(df.columns))
print('# of observations:', len(df))
df[:1]

# of columns: 202
# of observations: 1895016
Wall time: 33.2 s


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,F9_09_PC_FEES_FOR_SVCE_FR_TOT,fiscal_year,Filer,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,501c3,F9_00_HD_FILER_STATE_US,ein_int,F9_00_HD_TIME_STAMP_yr
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,,2010,"{'EIN': '232705170', 'Name': {'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}, 'NameControl': 'RONA', 'Phone': '8565826843', 'USAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300...",2016-02-24 21:20:13+00:00,1,,,,,1.0,,,1473903,0,,,MICHAEL ANTON,2011-11-04,,PA,2010-12-31,2010,2011-11-09 06:41:09-06:00,,1.0,,,,,1992.0,0.0,1439340,1044925.0,638637.0,10,30447.0,1753405.0,243131.0,0.0,0,0.0,0.0,89152,193604.0,,2440859,881768.0,195892,0,0.0,450430,1075372.0,0,0.0,10,0,925000,33563,1990429,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,459751,1000,0,0,0.0,1925215.0,1384751,171810.0,1473903,0,0,0,1,1.0,0.0,0,1,0,0,0,0,0,0,,1.0,0,,0,0,0,1.0,1,1.0,10,10,0,0.0,,,,"[PA, NJ, DE]",0,0,0,1.0,0.0,0.0,0,0.0,0.0,0.0,1439340.0,,,,,,,,,,,,,,,1439340.0,1000.0,,1473903,,,,,,,,,21675.0,,215.0,,,,,,,,,,,,,,,,,,,,1384751,195892.0,145115.0,1043744.0,,,,256845.0,86228.0,,1.0,,240077.0,,332660.0,270700.0,,,,2440859,,,,89152.0,,0,1.0,,,1.0,,0.0,1,1,PA,232705170,2011


### Inspect Data to See Which Columns can be filled -- Next Run Don't Need to do this -- Use Instead the Updated *concordance* File to Identify Columns that Should Be Excluded

In [40]:
pd.set_option('display.float_format', lambda x: '%.0f' % x)

In [41]:
%%time 
df[df.columns.tolist()[:35]].describe(percentiles=[]).T

Wall time: 2.84 s


Unnamed: 0,count,mean,std,min,50%,max
F9_09_PC_FEES_FOR_SVCE_FR_TOT,435088,19128,244257,-35000,0,32764282
F9_00_HD_ADDR_CHANGE,74526,1,0,1,1,1
F9_00_HD_AMENDED_RETURN,16642,1,0,1,1,1
F9_00_HD_EXEMPT_STATUS_4847A1,1514,1,0,1,1,1
F9_00_HD_EXEMPT_STATUS_501C,486371,7,4,2,6,29
F9_00_HD_EXEMPT_STATUS_501C3,1407131,1,0,1,1,1
F9_00_HD_FINAL_RETURN,10123,1,0,1,1,1
F9_00_HD_GROSS_RCPT,1895016,16053257,519366722,0,580572,310516974055
F9_00_HD_GROUP_RETURN,1895016,0,0,0,0,1
F9_00_HD_INCLUDES_SUBORD_ORGS,297904,0,0,0,0,1


In [42]:
exclude_cols = ['F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_YEAR_FORMED']

In [43]:
%%time 
df[df.columns.tolist()[35:60]].describe(percentiles=[]).T

Wall time: 2.4 s


Unnamed: 0,count,mean,std,min,50%,max
F9_01_PC_BEN_PAID_MEMB_PRIOR,881997,1613615,48679367,-189170543,0,8893847473
F9_01_PC_CONTR_GRANTS_CURR,1895016,1830745,27027125,-654611,114306,9265119609
F9_01_PC_CONTR_GRANTS_PRIOR,1636002,2011591,27163890,-900000,159477,9265119609
F9_01_PC_GRANTS_PRIOR,1077233,1123880,17388096,-341071,0,4429165079
F9_01_PC_INDEP_VOTING_MEMB,1895016,19,731,0,8,830201
F9_01_PC_INVEST_INCOME_PRIOR,1594890,465987,16312345,-5148733602,1104,9166085042
F9_01_PC_NET_ASSETS_BOY,1871194,11397256,229042003,-4419198170,515730,61288153211
F9_01_PC_OTHER_EXPENSE_PRIOR,1828459,4672287,117676670,-186053238,239248,50467127024
F9_01_PC_OTHER_REV_PRIOR,1425060,320335,5921829,-651010000,9991,1723980625
F9_01_PC_PROF_FUNDRISING_EXP_CURR,1895016,4408,117524,-35000,0,32764282


In [44]:
%%time 
df[df.columns.tolist()[60:90]].describe(percentiles=[]).T

Wall time: 2.43 s


Unnamed: 0,count,mean,std,min,50%,max
F9_01_PZ_BEN_PAID_TO_MEMB_CURR,1895016,783262,34334848,-189170543,0,8893847473
F9_01_PZ_GRANTS_PAID_CURR,1895016,686255,13904960,-341071,0,4798368744
F9_01_PZ_INVEST_INCOME_CURR,1895016,457764,15664382,-802196000,446,9166085042
F9_01_PZ_NAFB_EOY,1895016,12041141,244661399,-5034822702,529860,61288153211
F9_01_PZ_OTHER_EXPENSE_CURR,1895016,4759825,123210724,-1530406504,239436,54619014197
F9_01_PZ_OTHER_REV_CURR,1895016,251709,5305318,-123236329,1186,1723980625
F9_01_PZ_PROG_SERVICE_REV_CURR,1895016,8198511,161715891,-131946331,96777,58512193717
F9_01_PZ_SALARIES_CURR,1895016,3870871,50569654,-10284019,117947,9300950001
F9_01_PZ_SALARIES_PRIOR,1529943,4550316,53239673,-2179118,199826,8438687895
F9_01_PZ_TOT_ASSETS_BOY,1873181,21544872,377655439,-98344486,777896,90967341073


In [50]:
mgt_outsourcing_cols = ['F9_06_PC_CHANGES_ORGANIZING_DOCS', 'F9_06_PC_DELEGATION_MGT_DUTIES', 
                        'F9_06_PC_DELEGATION_OF_MGT']

In [45]:
%%time 
df[df.columns.tolist()[90:120]].describe(percentiles=[]).T

Wall time: 2.57 s


Unnamed: 0,count,mean,std,min,50%,max
F9_06_PC_JOINT_VENTURE_INVESTMNT,1895016,0,0,0,0,1
F9_06_PC_JOINT_VENTURE_POLICY,99890,0,0,0,0,1
F9_06_PC_LOCAL_CHAPTERS,1895016,0,0,0,0,1
F9_06_PC_MATERIAL_DIVERSION,1895016,0,0,0,0,1
F9_06_PC_MEMBERS_OR_STOCKHOLDERS,1895016,0,0,0,0,1
F9_06_PC_MINUTES_COMMITTEES,1888476,1,0,0,1,1
F9_06_PC_MINUTES_GOVERNING_BODY,1895016,1,0,0,1,1
F9_06_PC_MONITORING_OF_COI_POLICY,1325335,1,0,0,1,1
F9_06_PC_NUM_IND_VOTING_MEMBERS,1895016,19,731,0,8,830201
F9_06_PC_NUM_VOTING_GOV_MEMBERS,1895016,20,433,0,9,447339


In [46]:
%%time 
df[df.columns.tolist()[120:150]].describe().T

Wall time: 3.01 s


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
F9_08_PC_FUNDRAISING_DIRECT_EXP,517853,77883,459695,-110359,6139,21682,62675,51988787
F9_08_PC_FUNDRAISING_EVENTS,333970,225731,3006363,-112579,8265,37394,130017,487001193
F9_08_PC_FUNDRAISING_GROSS_INC,538184,101850,464484,-435270,10564,35375,95350,51988787
F9_08_PC_GAMING_DIRECT_EXPENSES,105402,246072,948035,-2096,0,7000,123146,38353740
F9_08_PC_GAMING_GROSS_INCOME,110711,278607,1182695,-3933,0,18813,162776,65651549
F9_08_PC_GOVERNMENT_GRANTS,526124,2587263,29364781,-386453,42167,199338,886769,7239962734
F9_08_PC_GROSS_SALES_INVENTORY,272555,708935,8533560,-626702,4371,42429,215944,1277677982
F9_08_PC_MEMBERSHIP_DUES,332392,271181,2617505,-350967,3505,23932,128850,357505219
F9_08_PC_NONCASH_CONTRIBUTIONS,310269,1320192,28081169,-464626,7200,40238,195616,6383613288
F9_08_PC_PROGRAM_SVCE_REV_TOTAL,1417558,10959906,186896297,-131946331,47747,257590,1250926,58512193717


In [54]:
%%time 
df[df.columns.tolist()[150:180]].describe().T

Wall time: 3.43 s


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
F9_09_PC_OTHER_EMP_BEN_MGMT,621369,183848,2954259,-18045287,2774,11441,50032,1223889692
F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,707564,928137,9081174,-21546453,11131,46056,238983,1363191407
F9_09_PC_OTHER_EMP_BEN_TOTAL,1100929,738599,8403664,-21467190,614,25875,150707,1375814505
F9_09_PC_OTHER_SALARY_FUNDRAISE,425298,148277,1090980,-226274,2897,20490,82087,99085531
F9_09_PC_OTHER_SALARY_MGMT,813391,775052,7570925,-12764001,15816,52571,206594,956452259
F9_09_PC_OTHER_SALARY_PROG_SVCE,988007,4668799,47566692,-2018148,64035,219523,1030718,6256732477
F9_09_PC_OTHER_SALARY_TOTAL,1356487,4040135,45491824,-396773,37780,167573,812148,6612941799
F9_09_PC_PAYROLL_TAX_FUNDRAISE,435025,11723,73864,-21772,412,2106,7418,7543216
F9_09_PC_PAYROLL_TAX_MGMT,830851,63862,589234,-1739403,2031,5983,20826,103212918
F9_09_PC_PAYROLL_TAX_PROG_SVCE,960684,339808,3186234,-435425,7190,21587,88770,436201871


In [55]:
%%time 
df[df.columns.tolist()[180:]].describe().T

Wall time: 2.11 s


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
F9_10_PC_SAVINGS_TEMP_INVEST_EOY,1240363,2141832,36554142,-253206305,19308,132724,533506,12651079908
F9_10_PC_SECURED_MORTGAGES_EOY,472996,4813626,75829871,-5139991,43194,366140,1718780,15858200062
F9_10_PC_UNSECURED_NOTES_BOY,181821,2646104,83794414,-2073331,0,2899,100000,10946624716
F9_10_PC_UNSECURED_NOTES_EOY,183436,2723043,79479847,-3154821,0,3025,100000,10064580236
F9_10_PZ_TOTAL_ASSETS_EOY,1895016,22535062,397105363,-98344486,213736,794108,3395944,90967341073
F9_11_PC_RECNCLTN_DONATED_SVCES,51256,42708,1116134,-73279162,0,0,2050,89834062
F9_11_PC_RECNCLTN_INVSTMNT_EXP,40632,-1751,317760,-16902760,0,0,0,39996125
F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,141358,-50782,14406348,-4440535144,-4412,0,2765,1549249736
F9_11_PC_RECNCLTN_REV_LESS_EXP,1861706,631485,40687595,-2041199044,-24207,10860,98082,50245931778
F9_11_PC_RECNCLTN_UNRLZD_GAIN,401729,528371,35259928,-6503895400,-14198,3091,77616,10357886104


In [56]:
#exclude_cols = exclude_cols +['']
print(exclude_cols)

['F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_YEAR_FORMED']


### Read in *concordance* file to see which columns should not be filled

In [65]:
concordance = pd.read_excel('concordance_VERIFIED.xlsx')
print('# of columns:', len(concordance.columns))
print('# of observations:', len(concordance))
concordance[:1]

# of columns: 17
# of observations: 384


Unnamed: 0,xpath,project,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,python_data_type,fill_null,BINARIZE,MongoDB_Name,sub_key,sub_sub_key
0,/Return/ReturnHeader/TaxPeriodEndDate,,F9_00_HD_TAX_PER_END,,,Will be nested under ReturnHeader,,Tax period end date,HEADER,HD,DateType,string,Do not fill null,,TaxPeriodEndDate,,


In [66]:
def agg_funcs(x):
    names = {
        'data_type_xsd': x['data_type_xsd'].head(1).values[0],
        'python_data_type': x['python_data_type'].head(1).values[0],
        'fill_null': x['fill_null'].head(1).values[0],       
    }
    #THE FOLLOWING SHORTCUT WORKS BUT CHANGES THE ORDER OF THE COLUMNS
    #return pd.Series(names, index = list(names.keys()))
    return pd.Series(names, index=['data_type_xsd', 'python_data_type', 'fill_null'])
new_variables_df = concordance.groupby(['variable_name_new']).apply(agg_funcs)
new_variables_df = new_variables_df.reset_index()
print('# of variables:', len(new_variables_df))
new_variables_df[:]

# of variables: 193


Unnamed: 0,variable_name_new,data_type_xsd,python_data_type,fill_null
0,F9_00_HD_ADDR_CHANGE,CheckboxType,Int64,
1,F9_00_HD_AMENDED_RETURN,CheckboxType,Int64,
2,F9_00_HD_BUILD_TIME_STAMP,TimestampType,DateTime,Do not fill null
3,F9_00_HD_CTRY_OF_DOMICILE,CountryType,string,Do not fill null
4,F9_00_HD_EXEMPT_STATUS_4847A1,CheckboxType,Int64,
...,...,...,...,...
188,F9_12_PC_AUDIT_COMMITTEE,BooleanType,Int64,
189,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,BooleanType,Int64,
190,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,BooleanType,Int64,
191,F9_12_PC_FINCL_STMTS_AUDITED,BooleanType,Int64,


In [67]:
new_variables_df['fill_null'].value_counts()

Do not fill null    19
Name: fill_null, dtype: int64

In [73]:
print(len(set(new_variables_df[new_variables_df['fill_null']=='Do not fill null']['variable_name_new'].tolist())))

19


In [68]:
string_cols = df.select_dtypes(include='object').columns.tolist()
print(len(string_cols))
print(string_cols, '\n')
#df[string_cols].describe().T

22
['OrganizationName', 'URL', 'DLN', 'TaxPeriod', 'EIN', 'fiscal_year', 'Filer', 'F9_00_HD_CTRY_OF_DOMICILE', 'F9_00_HD_GROSS_EXEMPT_NUM', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_SIGNING_OFFICER_SIGNTR', 'F9_00_HD_SPECIAL_CONDITION_DESC', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TAX_PER_END', 'F9_00_HD_TIME_STAMP', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_WEBSITE', 'F9_01_PZ_ORGANIZATIONAL_MISSION', 'F9_06_PC_STATES_WHERE_RET_FILED', 'F9_12_PC_ACCTG_METHOD_OTHER', 'F9_00_HD_FILER_STATE_US', 'F9_00_HD_TIME_STAMP_yr'] 



In [69]:
set(new_variables_df[new_variables_df['fill_null']=='Do not fill null']['variable_name_new'].tolist()) - set(string_cols)

{'F9_00_HD_BUILD_TIME_STAMP',
 'F9_00_HD_EXEMPT_STATUS_501C',
 'F9_00_HD_TAX_YEAR',
 'F9_00_HD_YEAR_FORMED'}

In [70]:
set(string_cols) - set(new_variables_df[new_variables_df['fill_null']=='Do not fill null']['variable_name_new'].tolist())

{'DLN',
 'EIN',
 'F9_00_HD_TIME_STAMP_yr',
 'Filer',
 'OrganizationName',
 'URL',
 'fiscal_year'}

In [74]:
no_fill_cols = list(set(string_cols + new_variables_df[new_variables_df['fill_null']=='Do not fill null']['variable_name_new'].tolist()))
print(len(no_fill_cols))

26


In [75]:
set(string_cols) - set(no_fill_cols)

set()

In [76]:
set(no_fill_cols) - set(string_cols)

{'F9_00_HD_BUILD_TIME_STAMP',
 'F9_00_HD_EXEMPT_STATUS_501C',
 'F9_00_HD_TAX_YEAR',
 'F9_00_HD_YEAR_FORMED'}

In [77]:
set(new_variables_df[new_variables_df['fill_null']=='Do not fill null']['variable_name_new'].tolist()) - set(no_fill_cols)

set()

In [78]:
set(no_fill_cols) - set(new_variables_df[new_variables_df['fill_null']=='Do not fill null']['variable_name_new'].tolist())

{'DLN',
 'EIN',
 'F9_00_HD_TIME_STAMP_yr',
 'Filer',
 'OrganizationName',
 'URL',
 'fiscal_year'}

In [79]:
no_fill_cols

['URL',
 'TaxPeriod',
 'F9_00_HD_YEAR_FORMED',
 'F9_00_HD_TIME_STAMP',
 'F9_00_HD_FILER_STATE_US',
 'F9_12_PC_ACCTG_METHOD_OTHER',
 'F9_00_HD_TAX_YEAR',
 'OrganizationName',
 'F9_00_HD_GROSS_EXEMPT_NUM',
 'DLN',
 'F9_00_HD_WEBSITE',
 'F9_00_HD_TYPE_ORG_OTHER_DESC',
 'F9_00_HD_TIME_STAMP_yr',
 'F9_01_PZ_ORGANIZATIONAL_MISSION',
 'F9_00_HD_EXEMPT_STATUS_501C',
 'F9_00_HD_PRIN_OFF_NAME',
 'F9_00_HD_BUILD_TIME_STAMP',
 'F9_00_HD_CTRY_OF_DOMICILE',
 'F9_00_HD_TAX_PER_END',
 'F9_00_HD_SPECIAL_CONDITION_DESC',
 'EIN',
 'F9_00_HD_STATE_OF_DOMICILE',
 'F9_06_PC_STATES_WHERE_RET_FILED',
 'fiscal_year',
 'Filer',
 'F9_00_HD_SIGNING_OFFICER_SIGNTR']

In [88]:
%%time
for col in no_fill_cols:
    print(col, ': ', len(df[df[col].isnull()]))

URL :  1
TaxPeriod :  0
F9_00_HD_YEAR_FORMED :  147078
F9_00_HD_TIME_STAMP :  0
F9_00_HD_FILER_STATE_US :  1578
F9_12_PC_ACCTG_METHOD_OTHER :  1855891
F9_00_HD_TAX_YEAR :  0
OrganizationName :  1
F9_00_HD_GROSS_EXEMPT_NUM :  1829525
DLN :  1
F9_00_HD_WEBSITE :  197092
F9_00_HD_TYPE_ORG_OTHER_DESC :  1860928
F9_00_HD_TIME_STAMP_yr :  0
F9_01_PZ_ORGANIZATIONAL_MISSION :  0
F9_00_HD_EXEMPT_STATUS_501C :  1408645
F9_00_HD_PRIN_OFF_NAME :  303757
F9_00_HD_BUILD_TIME_STAMP :  0
F9_00_HD_CTRY_OF_DOMICILE :  1893995
F9_00_HD_TAX_PER_END :  0
F9_00_HD_SPECIAL_CONDITION_DESC :  1894215
EIN :  0
F9_00_HD_STATE_OF_DOMICILE :  107338
F9_06_PC_STATES_WHERE_RET_FILED :  927884
fiscal_year :  0
Filer :  0
F9_00_HD_SIGNING_OFFICER_SIGNTR :  0
Wall time: 25.6 s


##### Fix one row

In [98]:
#pd.set_option('max_colwidth', 500)

In [99]:
#df[df['DLN'].isnull()][['Filer', '501c3']]

Unnamed: 0,Filer,501c3
1895015,"{'EIN': '204814407', 'BusinessName': {'BusinessNameLine1Txt': 'PLAY FLAG FOOTBALL'}, 'BusinessNameControlTxt': 'PLAY', 'PhoneNum': '4083700500', 'USAddress': {'AddressLine1Txt': '545 WESTCHESTER DR NO A', 'CityNm': 'CAMPBELL', 'StateAbbreviationCd': 'CA', 'ZIPCd': '95008'}}",1


In [92]:
df[df['OrganizationName'].isnull()][['Filer', '501c3']]

Unnamed: 0,Filer,501c3
1895015,"{'EIN': '204814407', 'BusinessName': {'BusinessNameLine1Txt': 'PLAY FLAG FOOTBALL'}, 'BusinessNameControlTxt': 'PLAY', 'PhoneNum': '4083700500', 'USAddress': {'AddressLine1Txt': '545 WESTCHESTER DR NO A', 'CityNm': 'CAMPBELL', 'StateAbbreviationC...",1


In [96]:
df.loc[1895015, 'OrganizationName'] = 'PLAY FLAG FOOTBALL'
df.loc[[1895015]]

Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,F9_09_PC_FEES_FOR_SVCE_FR_TOT,fiscal_year,Filer,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,501c3,F9_00_HD_FILER_STATE_US,ein_int,F9_00_HD_TIME_STAMP_yr
1895015,PLAY FLAG FOOTBALL,,,201812,204814407,,2018,"{'EIN': '204814407', 'BusinessName': {'BusinessNameLine1Txt': 'PLAY FLAG FOOTBALL'}, 'BusinessNameControlTxt': 'PLAY', 'PhoneNum': '4083700500', 'USAddress': {'AddressLine1Txt': '545 WESTCHESTER DR NO A', 'CityNm': 'CAMPBELL', 'StateAbbreviationC...",2020-04-17 16:48:07+00:00,,,,,,1,,,1075,0,,,JOHN MORA,2019-11-14,,CA,2018-12-31,2018,2019-11-15 10:29:26-06:00,,1,,,,WWW.PLAYFLAGFOOTBALL.COM,2006,0,0,0,0,0,0,32567,8092,0,0,0,250,-9303,-7842,,23264,8092,0,0,300,0,250,0,0,3,0,0,0,23264,OPERATING YOUTH SPORTS PROGRAMS FOR THE PUBLIC BENEFIT,10378,0,1075,0,0,32567,10378,0,1075,0,0,0,0,1,0,0,1,0,0,0,1,0,0,,1,0,,0,0,0,1,1,1,0,3,0,0,,,,CA,1,0,0,1,0,0,0,0,0,0,,,,,,,,,,,,,,1075,,,,1075,1075,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10378,0,1729,8649,,6630,4245,215130,196111,,1,,,,,,,,,23264,,,,-9303,,0,,1,,,,0,0,1,CA,204814407,2019


# Create version with null values filled

In [None]:
pd.set_option('display.float_format', lambda x: '%.0f' % x)

In [100]:
%%time 
missingcols = list(df.columns[df.isnull().any()])
print(len(missingcols))
print(missingcols)

137
['URL', 'DLN', 'F9_09_PC_FEES_FOR_SVCE_FR_TOT', 'F9_00_HD_ADDR_CHANGE', 'F9_00_HD_AMENDED_RETURN', 'F9_00_HD_CTRY_OF_DOMICILE', 'F9_00_HD_EXEMPT_STATUS_4847A1', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_EXEMPT_STATUS_501C3', 'F9_00_HD_FINAL_RETURN', 'F9_00_HD_GROSS_EXEMPT_NUM', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 'F9_00_HD_INITIAL_RETURN', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_SPECIAL_CONDITION_DESC', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TYPE_ORG_ASSOCIATION', 'F9_00_HD_TYPE_ORG_CORP', 'F9_00_HD_TYPE_ORG_OTHER', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_TYPE_ORG_TRUST', 'F9_00_HD_WEBSITE', 'F9_00_HD_YEAR_FORMED', 'F9_01_PC_BEN_PAID_MEMB_PRIOR', 'F9_01_PC_CONTR_GRANTS_PRIOR', 'F9_01_PC_GRANTS_PRIOR', 'F9_01_PC_INVEST_INCOME_PRIOR', 'F9_01_PC_NET_ASSETS_BOY', 'F9_01_PC_OTHER_EXPENSE_PRIOR', 'F9_01_PC_OTHER_REV_PRIOR', 'F9_01_PC_PROF_FUNDRISING_EXP_PRIOR', 'F9_01_PC_PROG_SERVICE_REV_PRIOR', 'F9_01_PC_REV_LESS_EXP_PRIOR', 'F9_01_PC_TERMINATION_CONTRACTION', 'F9_01_PC_TOT_EXP_PRIOR', 'F9_0

In [105]:
print(len(missingcols))
print(len(set(missingcols) - set(no_fill_cols)))
print(len(set(no_fill_cols) - set(missingcols)))

137
123
12


In [106]:
set(no_fill_cols) - set(missingcols)

{'EIN',
 'F9_00_HD_BUILD_TIME_STAMP',
 'F9_00_HD_SIGNING_OFFICER_SIGNTR',
 'F9_00_HD_TAX_PER_END',
 'F9_00_HD_TAX_YEAR',
 'F9_00_HD_TIME_STAMP',
 'F9_00_HD_TIME_STAMP_yr',
 'F9_01_PZ_ORGANIZATIONAL_MISSION',
 'Filer',
 'OrganizationName',
 'TaxPeriod',
 'fiscal_year'}

In [107]:
set(missingcols).intersection(set(no_fill_cols))

{'DLN',
 'F9_00_HD_CTRY_OF_DOMICILE',
 'F9_00_HD_EXEMPT_STATUS_501C',
 'F9_00_HD_FILER_STATE_US',
 'F9_00_HD_GROSS_EXEMPT_NUM',
 'F9_00_HD_PRIN_OFF_NAME',
 'F9_00_HD_SPECIAL_CONDITION_DESC',
 'F9_00_HD_STATE_OF_DOMICILE',
 'F9_00_HD_TYPE_ORG_OTHER_DESC',
 'F9_00_HD_WEBSITE',
 'F9_00_HD_YEAR_FORMED',
 'F9_06_PC_STATES_WHERE_RET_FILED',
 'F9_12_PC_ACCTG_METHOD_OTHER',
 'URL'}

In [108]:
missingcols = list(set(missingcols) - set(no_fill_cols))
print(len(missingcols))

123


In [109]:
set(missingcols).intersection(set(no_fill_cols))

set()

<br>Descriptives for rows not missing data

In [110]:
df[[c for c in df.columns.tolist() if c not in missingcols]].describe(percentiles=[]).T

Unnamed: 0,count,mean,std,min,50%,max
F9_00_HD_EXEMPT_STATUS_501C,486371,7,4,2,6,29
F9_00_HD_GROSS_RCPT,1895016,16053257,519366722,0,580572,310516974055
F9_00_HD_GROUP_RETURN,1895016,0,0,0,0,1
F9_00_HD_TAX_YEAR,1895016,2014,3,2009,2015,2019
F9_00_HD_YEAR_FORMED,1747938,1981,50,1003,1988,9999
F9_01_PC_CONTR_GRANTS_CURR,1895016,1830745,27027125,-654611,114306,9265119609
F9_01_PC_INDEP_VOTING_MEMB,1895016,19,731,0,8,830201
F9_01_PC_PROF_FUNDRISING_EXP_CURR,1895016,4408,117524,-35000,0,32764282
F9_01_PC_REV_LESS_EXP_CURR,1895016,634096,40369426,-2041199044,10894,50245931778
F9_01_PC_TOT_ASSETS_EOY,1895016,22535058,397105363,-98344486,794104,90967341073


<br>Descriptives for rows missing data

In [112]:
df[missingcols[:30]].describe(percentiles=[]).T

Unnamed: 0,count,mean,std,min,50%,max
F9_09_PC_PAYROLL_TAX_TOTAL,1340159,299407,3041908,-478831,18192,461035788
F9_09_PC_PENSION_CONT_FUNDRAISE,177257,18503,147362,-236355,1389,16919989
F9_01_PC_TOT_EXP_PRIOR,1834415,9892775,155839860,-19865668,466802,53764558598
F9_10_PC_LOANS_FROM_OFFICERS_EOY,101280,151488,3469878,-220250,0,423000000
F9_06_PC_OTHER_WEBSITE,251612,1,0,1,1,1
F9_09_PC_PENSION_CONT_MGMT,340426,119742,1619569,-34665482,6355,319188438
F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,867879,10103,268999,-19443,0,177765602
F9_08_PC_RELATED_ORGANIZATIONS,126705,1608168,32479226,-1475173,52000,9265119609
F9_09_PC_COMP_DISQUAL_FUNDRAISE,40056,15244,86034,0,0,4519359
F9_08_PC_GAMING_GROSS_INCOME,110711,278607,1182695,-3933,18813,65651549


In [113]:
df[missingcols[30:60]].describe(percentiles=[]).T

Unnamed: 0,count,mean,std,min,50%,max
F9_01_PZ_SALARIES_PRIOR,1529943,4550316,53239673,-2179118,199826,8438687895
F9_08_PC_MEMBERSHIP_DUES,332392,271181,2617505,-350967,23932,357505219
F9_11_PC_RECNCLTN_REV_LESS_EXP,1861706,631485,40687595,-2041199044,10860,50245931778
F9_12_PC_ACCTG_METHOD_CASH,591946,1,0,1,1,1
F9_09_PC_COMP_OFFICERS_PROG_SVCE,636526,193539,821974,-469889,62151,73445350
F9_09_PC_PAYROLL_TAX_FUNDRAISE,435025,11723,73864,-21772,2106,7543216
F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,1552639,20760,131484,-5231831,5950,35384225
F9_09_PC_TOTAL_MGMT_EXPENSE,1752587,1168146,16020025,-26656443,44938,3559810150
F9_00_HD_EXEMPT_STATUS_4847A1,1514,1,0,1,1,1
F9_10_PC_RET_EARNINGS_ENDWMT_EOY,414016,6744140,234670120,-4218815300,233706,60829733185


In [114]:
df[missingcols[60:90]].describe(percentiles=[]).T

Unnamed: 0,count,mean,std,min,50%,max
F9_06_PC_CEO_COMPENSTN_PROCESS,1884238,0,0,0,0,1
F9_09_PC_PENSION_CONT_TOTAL,815157,371380,5257367,-69839423,4000,958578741
F9_09_PC_OTHER_SALARY_FUNDRAISE,425298,148277,1090980,-226274,20490,99085531
F9_06_PC_FORM_UPON_REQUEST,1725127,1,0,1,1,1
F9_10_PC_ORG_NOT_FOLLOW_SFAS117,426692,1,0,1,1,1
F9_12_PC_AUDIT_COMMITTEE,1123853,1,0,0,1,1
F9_09_PC_OTHER_EMP_BEN_TOTAL,1100929,738599,8403664,-21467190,25875,1375814505
F9_00_HD_AMENDED_RETURN,16642,1,0,1,1,1
F9_09_PC_OTHER_SALARY_TOTAL,1356487,4040135,45491824,-396773,167573,6612941799
F9_12_PC_ACCTG_METHOD_ACCRUAL,1263945,1,0,1,1,1


In [115]:
df[missingcols[90:]].describe(percentiles=[]).T

Unnamed: 0,count,mean,std,min,50%,max
F9_10_PC_BOND_LIABILITIES_EOY,135949,28233269,160573618,-688803,0,9365288192
F9_10_PC_SAVINGS_TEMP_INVEST_BOY,1108586,2313592,36340376,-101457150,175208,12651079908
F9_08_PC_PROGRAM_SVCE_REV_TOTAL,1417558,10959906,186896297,-131946331,257590,58512193717
F9_09_PC_COMP_DISQUAL_PROG_SVCE,61909,515840,4767445,-59504,21572,352622231
F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,1759321,8569669,155216109,-188083004,329266,55118337884
F9_00_HD_INCLUDES_SUBORD_ORGS,297904,0,0,0,0,1
F9_09_PC_FEES_FOR_SVCE_INVST_TOT,638624,62962,1191338,-2231920,0,231258838
F9_08_PC_GROSS_SALES_INVENTORY,272555,708935,8533560,-626702,42429,1277677982
F9_06_PC_OWN_WEBSITE,117868,1,0,1,1,1
F9_10_PC_SECURED_MORTGAGES_EOY,472996,4813626,75829871,-5139991,366140,15858200062


In [117]:
df[missingcols[:50]].dtypes

F9_09_PC_PAYROLL_TAX_TOTAL            float64
F9_09_PC_PENSION_CONT_FUNDRAISE       float64
F9_01_PC_TOT_EXP_PRIOR                float64
F9_10_PC_LOANS_FROM_OFFICERS_EOY      float64
F9_06_PC_OTHER_WEBSITE                float64
F9_09_PC_PENSION_CONT_MGMT            float64
F9_01_PC_PROF_FUNDRISING_EXP_PRIOR    float64
F9_08_PC_RELATED_ORGANIZATIONS        float64
F9_09_PC_COMP_DISQUAL_FUNDRAISE       float64
F9_08_PC_GAMING_GROSS_INCOME          float64
F9_01_PC_INVEST_INCOME_PRIOR          float64
F9_09_PC_TOTAL_FUNDRAISE_EXPENSE      float64
F9_10_PC_CASH_NON_INTEREST_BOY        float64
F9_08_PC_CONTS_REPRTD_FNDRAISNG       float64
F9_09_PC_COMP_DISQUAL_MGMT            float64
F9_01_PC_TOT_INDIV_VOLUNTEERS         float64
F9_07_PC_NUM_INDS_GREATER_100K        float64
F9_10_PC_CASH_NON_INTEREST_EOY        float64
F9_01_PC_CONTR_GRANTS_PRIOR           float64
F9_07_PC_TOT_REPRT_COMP_RLTD_ORG      float64
F9_08_PC_COST_OF_GOODS_SOLD           float64
F9_07_PC_TOT_OTHER_COMPENSATION   

In [118]:
df[missingcols[50:100]].dtypes

F9_06_PC_MINUTES_COMMITTEES          float64
F9_01_PC_TOT_REVENUE_PRIOR           float64
F9_01_PC_REV_LESS_EXP_PRIOR          float64
F9_01_PC_BEN_PAID_MEMB_PRIOR         float64
F9_01_PC_NET_ASSETS_BOY              float64
F9_06_PC_MONITORING_OF_COI_POLICY    float64
F9_12_PC_FED_GRNT_AUDIT_PERFORMD     float64
F9_09_PC_OTHER_EMP_BEN_FUNDRAISE     float64
F9_06_PC_POLICIES_GOVERN_CHAPTER     float64
F9_09_PC_FEES_FOR_SVCE_MGMT_TOT      float64
F9_06_PC_CEO_COMPENSTN_PROCESS       float64
F9_09_PC_PENSION_CONT_TOTAL          float64
F9_09_PC_OTHER_SALARY_FUNDRAISE      float64
F9_06_PC_FORM_UPON_REQUEST           float64
F9_10_PC_ORG_NOT_FOLLOW_SFAS117      float64
F9_12_PC_AUDIT_COMMITTEE             float64
F9_09_PC_OTHER_EMP_BEN_TOTAL         float64
F9_00_HD_AMENDED_RETURN                Int64
F9_09_PC_OTHER_SALARY_TOTAL          float64
F9_12_PC_ACCTG_METHOD_ACCRUAL        float64
F9_00_HD_ADDR_CHANGE                   Int64
F9_08_PC_GOVERNMENT_GRANTS           float64
F9_10_PC_O

In [119]:
df[missingcols[100:]].dtypes

F9_09_PC_OTHER_SALARY_PROG_SVCE     float64
F9_00_HD_TYPE_ORG_OTHER             float64
F9_09_PC_COMP_OFFICERS_MGMT         float64
F9_10_PC_UNSECURED_NOTES_EOY        float64
F9_01_PC_OTHER_EXPENSE_PRIOR        float64
F9_00_HD_FINAL_RETURN               float64
F9_00_HD_EXEMPT_STATUS_501C3        float64
F9_08_PC_TOTAL_PROG_SVCE_REVENUE    float64
F9_11_PC_RECNCLTN_DONATED_SVCES     float64
F9_08_PC_TOTAL_CONTRIBUTIONS        float64
F9_09_PC_FEES_FOR_SVCE_OTH_TOT      float64
F9_00_HD_TYPE_ORG_ASSOCIATION       float64
F9_01_PZ_TOT_ASSETS_BOY             float64
F9_08_PC_NONCASH_CONTRIBUTIONS      float64
F9_09_PC_PAYROLL_TAX_MGMT           float64
F9_09_PC_COMP_DISQUAL_TOTAL         float64
F9_00_HD_TYPE_ORG_TRUST             float64
F9_01_PZ_TOT_LIAB_BOY               float64
F9_09_PC_FEES_FOR_SVCE_LOBB_TOT     float64
F9_09_PC_OTHER_EMP_BEN_MGMT         float64
F9_06_PC_OTHER_COMPENSTN_PROCESS    float64
F9_07_PC_NO_LISTED_PERS_COMPENSD    float64
F9_07_PC_TOT_REPRT_COMP_FROM_ORG

### Write function to fill missing values and then loop over the variables in *missingcol*

In [120]:
def fillnull(var):
    #print(df[var].value_counts().to_frame().head(), '\n')
    print('# of missing observations in %s before processing:' % var, len(df[df[var].isnull()]))
    
    df[var] = np.where(df[var].isnull(), 0, df[var])
    
    #print(df[var].value_counts().to_frame().head(), '\n')
    print('# of missing observations in %s after processing:' % var, len(df[df[var].isnull()]), '\n')
    return df.sample(5)[['URL', var]]
    #print(df[[newvar, var1, var2, 'ObjectId']][:5], '\n\n\n')

In [122]:
%%time
for c in missingcols[:]:
    fillnull(c)

# of missing observations in F9_09_PC_PAYROLL_TAX_TOTAL before processing: 554857
# of missing observations in F9_09_PC_PAYROLL_TAX_TOTAL after processing: 0 

# of missing observations in F9_09_PC_PENSION_CONT_FUNDRAISE before processing: 1717759
# of missing observations in F9_09_PC_PENSION_CONT_FUNDRAISE after processing: 0 

# of missing observations in F9_01_PC_TOT_EXP_PRIOR before processing: 60601
# of missing observations in F9_01_PC_TOT_EXP_PRIOR after processing: 0 

# of missing observations in F9_10_PC_LOANS_FROM_OFFICERS_EOY before processing: 1793736
# of missing observations in F9_10_PC_LOANS_FROM_OFFICERS_EOY after processing: 0 

# of missing observations in F9_06_PC_OTHER_WEBSITE before processing: 1643404
# of missing observations in F9_06_PC_OTHER_WEBSITE after processing: 0 

# of missing observations in F9_09_PC_PENSION_CONT_MGMT before processing: 1554590
# of missing observations in F9_09_PC_PENSION_CONT_MGMT after processing: 0 

# of missing observations in F9

# of missing observations in F9_06_PC_MINUTES_COMMITTEES after processing: 0 

# of missing observations in F9_01_PC_TOT_REVENUE_PRIOR before processing: 56877
# of missing observations in F9_01_PC_TOT_REVENUE_PRIOR after processing: 0 

# of missing observations in F9_01_PC_REV_LESS_EXP_PRIOR before processing: 58939
# of missing observations in F9_01_PC_REV_LESS_EXP_PRIOR after processing: 0 

# of missing observations in F9_01_PC_BEN_PAID_MEMB_PRIOR before processing: 1013019
# of missing observations in F9_01_PC_BEN_PAID_MEMB_PRIOR after processing: 0 

# of missing observations in F9_01_PC_NET_ASSETS_BOY before processing: 23822
# of missing observations in F9_01_PC_NET_ASSETS_BOY after processing: 0 

# of missing observations in F9_06_PC_MONITORING_OF_COI_POLICY before processing: 569681
# of missing observations in F9_06_PC_MONITORING_OF_COI_POLICY after processing: 0 

# of missing observations in F9_12_PC_FED_GRNT_AUDIT_PERFORMD before processing: 1660528
# of missing observa

# of missing observations in F9_00_HD_TYPE_ORG_OTHER before processing: 1849070
# of missing observations in F9_00_HD_TYPE_ORG_OTHER after processing: 0 

# of missing observations in F9_09_PC_COMP_OFFICERS_MGMT before processing: 1214041
# of missing observations in F9_09_PC_COMP_OFFICERS_MGMT after processing: 0 

# of missing observations in F9_10_PC_UNSECURED_NOTES_EOY before processing: 1711580
# of missing observations in F9_10_PC_UNSECURED_NOTES_EOY after processing: 0 

# of missing observations in F9_01_PC_OTHER_EXPENSE_PRIOR before processing: 66557
# of missing observations in F9_01_PC_OTHER_EXPENSE_PRIOR after processing: 0 

# of missing observations in F9_00_HD_FINAL_RETURN before processing: 1884893
# of missing observations in F9_00_HD_FINAL_RETURN after processing: 0 

# of missing observations in F9_00_HD_EXEMPT_STATUS_501C3 before processing: 487885
# of missing observations in F9_00_HD_EXEMPT_STATUS_501C3 after processing: 0 

# of missing observations in F9_08_PC_T

#### Check columns that still have null values

In [123]:
list(df.columns[df.isnull().any()])

['URL',
 'DLN',
 'F9_00_HD_CTRY_OF_DOMICILE',
 'F9_00_HD_EXEMPT_STATUS_501C',
 'F9_00_HD_GROSS_EXEMPT_NUM',
 'F9_00_HD_PRIN_OFF_NAME',
 'F9_00_HD_SPECIAL_CONDITION_DESC',
 'F9_00_HD_STATE_OF_DOMICILE',
 'F9_00_HD_TYPE_ORG_OTHER_DESC',
 'F9_00_HD_WEBSITE',
 'F9_00_HD_YEAR_FORMED',
 'F9_06_PC_STATES_WHERE_RET_FILED',
 'F9_12_PC_ACCTG_METHOD_OTHER',
 'F9_00_HD_FILER_STATE_US']

#### Save DF

In [124]:
%%time
df.to_pickle('all filings nov. 2020 - all control variables (with parsed sub-key variables and reformatted types and fillnull).pkl')

Wall time: 1min 11s


In [143]:
#%%time
#df = pd.read_pickle('all filings nov. 2020 - all control variables (with parsed sub-key variables and reformatted types and fillnull).pkl')
#print('# of columns:', len(df.columns))
#print('# of observations:', len(df))
#df[:1]

Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,F9_09_PC_FEES_FOR_SVCE_FR_TOT,fiscal_year,Filer,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,501c3,F9_00_HD_FILER_STATE_US,ein_int,F9_00_HD_TIME_STAMP_yr
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,0,2010,"{'EIN': '232705170', 'Name': {'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}, 'NameControl': 'RONA', 'Phone': '8565826843', 'USAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300', 'City': 'BETHLEHEM', 'State': 'PA', 'ZIPCode': '18017'}}",2016-02-24 21:20:13+00:00,1,0,,0,,1,0,,1473903,0,0,0,MICHAEL ANTON,2011-11-04,,PA,2010-12-31,2010,2011-11-09 06:41:09-06:00,0,1,0,,0,,1992,0,1439340,1044925,638637,10,30447,1753405,243131,0,0,0,0,89152,193604,0,2440859,881768,195892,0,0,450430,1075372,0,0,10,0,925000,33563,1990429,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,459751,1000,0,0,0,1925215,1384751,171810,1473903,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,1,10,10,0,0,0,0,0,"[PA, NJ, DE]",0,0,0,1,0,0,0,0,0,0,1439340,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1439340,1000,0,1473903,0,0,0,0,0,0,0,0,21675,0,215,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1384751,195892,145115,1043744,0,0,0,256845,86228,0,1,0,240077,0,332660,270700,0,0,0,2440859,0,0,0,89152,0,0,1,0,,1,0,0,1,1,PA,232705170,2011


#### Fix *F9_00_HD_SPECIAL_CONDITION_DESC*

In [None]:
df['F9_00_HD_SPECIAL_CONDITION_DESC__SAFE'] = df['F9_00_HD_SPECIAL_CONDITION_DESC']

In [144]:
for index, row in df[df['F9_00_HD_SPECIAL_CONDITION_DESC'].notnull()][:5].iterrows():
    print(type(row['F9_00_HD_SPECIAL_CONDITION_DESC']), row['F9_00_HD_SPECIAL_CONDITION_DESC'])

<class 'str'> EXTENSION GRANTED TO 11152011
<class 'str'> EXTENSION GRANTED TO 21511
<class 'str'> EXTENDED TO FEBRUARY 15 2011
<class 'list'> ['WITH AN EXPLANATORY STATEMENT.', 'THE TAXPAYER FILES AND REPORTS ITS ACTIVITIES ON', 'THE FEDERAL EXTENSIONS FILED ON A CALENDAR YEAR', 'PROTECTIVE FEDERAL EXTENSIONS WERE FILED ON A', 'FISCAL YEAR 6-30-2010 EXTENSIONS ARE INCLUDED', 'E-FILING: ACCOUNTING PERIODS & FEDERAL EXTENSIONS', 'DID NOT AUTHORIZE A CHANGE IN ACCOUNTING PERIOD.', 'CALENDAR YEAR BASIS BUT THE BOARD OF DIRECTORS', 'BASIS ARE REVOKED AND RESCINDED AND THE', 'A JUNE 30 FISCAL YEAR BASIS.']
<class 'list'> ['YEAR JUNE 30, PERIOD. THE TAXPAYER FILED VALID', 'TIMELY EXTENSIONS ON A FISCAL YEAR BASIS.', 'THE TAXPAYER REPORTS ITS ACTIVTIES ON A FISCAL', 'THE TAXPAYER HAD FILED PROTECTIVE EXTENSIONS', 'THE BOARD DID NOT AUTHORIZE THE CHANGE IN', 'TAX RETURN.', 'SEE ATTACHED MEMORANDUM INCLUDED WITH THE', 'RULES REQUIRED UNDER GAAP AND GAAS.', 'ON A CALENDAR YEAR BASIS BECAUSE THE 

In [145]:
df['F9_00_HD_SPECIAL_CONDITION_DESC'] = df['F9_00_HD_SPECIAL_CONDITION_DESC'].apply(lambda x: ' '.join(x) if type(x)==list else x)

In [146]:
df['F9_00_HD_SPECIAL_CONDITION_DESC'].value_counts().head()

PUBLIC DISCLOSURE COPY         26
HURRICANE IRMA                 18
EXTENSION GRANTED TO 111519    17
HURRICANE SANDY                12
EXTENDED TO 11152019           12
Name: F9_00_HD_SPECIAL_CONDITION_DESC, dtype: int64

In [151]:
for index, row in df[df['F9_00_HD_SPECIAL_CONDITION_DESC'].notnull()][:5].iterrows():
    print(type(row['F9_00_HD_SPECIAL_CONDITION_DESC']), row['F9_00_HD_SPECIAL_CONDITION_DESC'])

<class 'str'> EXTENSION GRANTED TO 11152011
<class 'str'> EXTENSION GRANTED TO 21511
<class 'str'> EXTENDED TO FEBRUARY 15 2011
<class 'str'> WITH AN EXPLANATORY STATEMENT. THE TAXPAYER FILES AND REPORTS ITS ACTIVITIES ON THE FEDERAL EXTENSIONS FILED ON A CALENDAR YEAR PROTECTIVE FEDERAL EXTENSIONS WERE FILED ON A FISCAL YEAR 6-30-2010 EXTENSIONS ARE INCLUDED E-FILING: ACCOUNTING PERIODS & FEDERAL EXTENSIONS DID NOT AUTHORIZE A CHANGE IN ACCOUNTING PERIOD. CALENDAR YEAR BASIS BUT THE BOARD OF DIRECTORS BASIS ARE REVOKED AND RESCINDED AND THE A JUNE 30 FISCAL YEAR BASIS.
<class 'str'> YEAR JUNE 30, PERIOD. THE TAXPAYER FILED VALID TIMELY EXTENSIONS ON A FISCAL YEAR BASIS. THE TAXPAYER REPORTS ITS ACTIVTIES ON A FISCAL THE TAXPAYER HAD FILED PROTECTIVE EXTENSIONS THE BOARD DID NOT AUTHORIZE THE CHANGE IN TAX RETURN. SEE ATTACHED MEMORANDUM INCLUDED WITH THE RULES REQUIRED UNDER GAAP AND GAAS. ON A CALENDAR YEAR BASIS BECAUSE THE BOARD OF ITS CALEDNAR YEAR EXTENSIONS. EXTENSIONS AND HAS A

In [152]:
df = df.drop('F9_00_HD_SPECIAL_CONDITION_DESC__SAFE', 1)

#### Save DF

In [163]:
%%time
df.to_pickle('all filings nov. 2020 - all control variables (with parsed sub-key variables and reformatted types and fillnull).pkl')

Wall time: 1min 13s
