#### Update of this notebook:
- *IRS Form 990 e-File Data (6b) -- Fill in Missing Values.ipynb*

# Overview

*Main purpose of notebook:* I created versions of the data with the null values filled with zeros. 

- Read in *concordance_VERIFIED.xlsx* in order to access the *fill_null* column
    - Collapse to *new_variables_df* then use that DF
    - Note that data verifications are done at the beginning of this notebook; specifically, I looked at descriptives for all variables to see which ones had null values that can be filled with zeros. For most if not all of the 'excluded
    variables (such as date variables and 501c 'type' variables), it is an obvious decision. 
    - Based on the analyses, I then saved a new column in *concordance_VERIFIED.xlsx* called 'fill_null' (column was filled out in Excel)

- Read in DF: 
    - *all filings may 2021 - all control variables (with parsed sub-key variables and reformatted types).pkl.gz*

- Create numeric version of EIN (*ein_int*)

- Identify columns with missing data:
    - missingcols = list(df.columns[df.isnull().any()])
    - the above list is then refined to exclude columns where *fill_null* = 'Do not fill null'
    - Write function to fill null values and then loop over *missing_cols* and apply

- Fixed one variable:
    - *F9_00_HD_SPECIAL_CONDITION_DESC* had each line as a different list, so I combined them into one text block.
    
- Fixed *OrganizationName* for one row

- File saved *with* null values filled:
    - *all filings may 2021 - all control variables (with parsed sub-key variables and reformatted types and fillnull).pkl.gz*
    

Notes:
- I no longer fix *problem_cols* (e.g., *F9_00_HD_EXEMPT_STATUS_501C*, *F9_12_PC_ACCTG_METHOD_OTHER*) nor change data types in this notebook
- I also no longer change any data types here


To Do:
- I limit to 501c3s in a later notebook

# Load Packages

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

In [2]:
print(pd.__version__)

2.2.2


In [3]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)

In [4]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)

In [61]:
import datetime
import gc

#### Set working directory

In [5]:
cd "C:\\Users\\Gregory\\IRS 990 Control Variables\\"

C:\Users\Gregory\IRS 990 Control Variables


# Read 990 DB into PANDAS DF
We can modify the above code block to read all filings into a PANDAS dataframe.

In [6]:
#%%time
#import datetime
#print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
#df = pd.read_pickle('all NEW filings February 2024 - all control variables (with parsed sub-key variables and reformatted types).pkl.gz',
#            compression='gzip')
#print('# of columns:', len(df.columns))
#print('# of observations:', len(df))
#df[:1]

Current date and time :  2024-03-30 13:58:11 

# of columns: 298
# of observations: 891980
CPU times: total: 25.8 s
Wall time: 30.1 s


Unnamed: 0,URL,F9_09_PC_FEES_FOR_SVCE_FR_TOT,F9_00_HD_BUILD_TIME_STAMP,fiscal_year,EIN,BusinessName,BusinessNameControlTxt,PhoneNum,USAddress,InCareOfNm,ForeignAddress,ForeignPhoneNum,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_BEGIN,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_03_PC_PGMSVC_SIGNIF_CHG,F9_03_PC_PGMSVC_SIGNIF_NEW,F9_03_PC_PROG_SVC_ACC_1_CODE,F9_03_PC_PROG_SVC_ACC_1_DESC,F9_03_PC_PROG_SVC_ACC_1_EXP,F9_03_PC_PROG_SVC_ACC_1_GRNT,F9_03_PC_PROG_SVC_ACC_1_REV,F9_03_PC_PROG_SVC_ACC_2_CODE,F9_03_PC_PROG_SVC_ACC_2_DESC,F9_03_PC_PROG_SVC_ACC_2_EXP,F9_03_PC_PROG_SVC_ACC_2_GRNT,F9_03_PC_PROG_SVC_ACC_2_REV,F9_03_PC_PROG_SVC_ACC_3_CODE,F9_03_PC_PROG_SVC_ACC_3_DESC,F9_03_PC_PROG_SVC_ACC_3_EXP,F9_03_PC_PROG_SVC_ACC_3_GRNT,F9_03_PC_PROG_SVC_ACC_3_REV,F9_03_PC_TOT_OTH_PROG_SVC_EXP,F9_03_PC_TOT_OTH_PROG_SVC_GRNT,F9_03_PC_TOT_OTH_PROG_SVC_REV,F9_03_PC_TOT_PROG_SVC_EXPENSE,F9_03_PZ_MISSION_DESCRIPTION,F9_03_PZ_SCHEDULE_O_PART3,F9_04_PC_ACTVITIES_VIA_PARTNER,F9_04_PC_CONTROLLED_ENTITY,F9_04_PC_DISREGARDED_ENTITY,F9_04_PC_EXCESS_BENEFIT_TRANS,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_LOBBYING_ACTIVITIES,F9_04_PC_POLITICAL_ACTIVITIES,F9_04_PC_PRIOR_EXCESS_BEN_TRAN,F9_04_PC_PROF_FR_EXP_GT_15K,F9_04_PC_RELATED_ENTITY,F9_04_PC_TRANS_TO_CNTRLD_ENT,F9_04_PC_TRANS_WITH_CNTRLD_ENT,F9_05_EXP_SCHED_O_X,F9_05_PC_NUMBER_EMPLOYEES_W3,F9_05_PC_NUMBER_FORMS_1096,F9_05_PC_UNRELATED_BUS_INCOME,F9_06_EXP_SCHED_O_X,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_EXP_SCHED_O_X,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_EXP_SCHED_O_X,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_EXP_AD_PROMO_TOT,F9_09_EXP_BENF_PAID_MEMB_TOT,F9_09_EXP_CONF_MEETING_TOT,F9_09_EXP_DEPREC_FUNDR,F9_09_EXP_DEPREC_MAG,F9_09_EXP_DEPREC_PROG,F9_09_EXP_DEPREC_TOT,F9_09_EXP_GRANT_FRGN_TOT,F9_09_EXP_GRANT_INDIV_DMSTC_TOT,F9_09_EXP_GRANT_ORG_DMSTC_TOT,F9_09_EXP_INFO_TECH_TOT,F9_09_EXP_INSURANCE_TOT,F9_09_EXP_INTEREST_TOT,F9_09_EXP_JOINT_COSTS_TOT,F9_09_EXP_OCCUPANCY_TOT,F9_09_EXP_OFFICE_TOT,F9_09_EXP_OTH_OTH_TOT,F9_09_EXP_ROY_TOT,F9_09_EXP_SCHED_O_X,F9_09_EXP_TRAVEL_ENTRTNMNT_TOT,F9_09_EXP_TRAVEL_TOT,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYMENT_TO_AFFILIATES,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_ASSETS_ACC_NET_EOY,F9_10_ASSETS_EXP_PREPAID_EOY,F9_10_ASSETS_INTANGIB_EOY,F9_10_ASSETS_INVENT_SALE_EOY,F9_10_ASSETS_LESS_DEPREC_EOY,F9_10_ASSETS_LOANS_DISQUAL_EOY,F9_10_ASSETS_NOTES_LOANS_NET_EOY,F9_10_ASSETS_OTH_EOY,F9_10_ASSETS_PLEDGES_NET_EOY,F9_10_LIAB_ACC_PAYABLE_EOY,F9_10_LIAB_GRANTS_PAYABLE_EOY,F9_10_LIAB_LOANS_OFF_EOY,F9_10_LIAB_REV_DEFERRED_EOY,F9_10_NAFB_RESTRICT_PERM_EOY,F9_10_NAFB_RESTRICT_TEMP_EOY,F9_10_NAFB_UNRESTRICT_EOY,F9_10_PC_BOND_LIABILITY_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_ESCROW_LIABILITY_EOY,F9_10_PC_INVEST_OTHER_SEC_EOY,F9_10_PC_INVEST_PROG_RELTD_EOY,F9_10_PC_INVEST_PUB_TRADED_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_SECURE_MORT_NOTES_EOY,F9_10_PC_UNSECURED_LOANS_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_10_SCHED_O_X,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_11_SCHED_O_X,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,F9_12_SCHED_O_X,number_of_other_prog_svces,501c3,F9_00_HD_FILER_ADDR_US_L1,F9_00_HD_FILER_ADDR_US_L2,F9_00_HD_FILER_CITY_US,F9_00_HD_FILER_ZIP_US,F9_00_HD_FILER_COUNTRY_FRGN,F9_00_HD_FILER_STATE_US,F9_00_HD_TIME_STAMP_yr
0,https://s3.amazonaws.com/irs-form-990/201812509349300101_public.xml,,2022-09-23 18:48:47+00:00,2018,346526754,{'BusinessNameLine1Txt': 'Lucas County Farm Bureau'},LUCA,4198338015,"{'AddressLine1Txt': '109 Portage St', 'CityNm': 'Woodville', 'StateAbbreviationCd': 'OH', 'ZIPCd': '43469'}",,,,,,,,5.0,,,,272756,0,,,KAYLA RICHARDS,2018-08-29,,OH,2017-08-01,2018-07-31,2017,2018-09-07 04:44:38-07:00,,1.0,,,,,1916.0,239263.0,236036,278582.0,,10,15945.0,383254.0,94080.0,29098.0,0,,,3331,-30456.0,,424855,354081.0,0,7,39.0,38270,323625.0,0,,10,181041,0,10719,386585,IMPROVE RURAL STANDARD OF LIVING.,65279,26001,0,23105,20738.0,424800.0,269425,41546.0,272756,0.0,0.0,,BENEFITS PAID TO OR FOR MEMBERS - THIS IS PAID MEMBERSHIPS TO OHIO FARM BUREAU AND TO AMERICAN FARM BUREAU TO FUTHER THEIR EFFORTS IN PROGRAMMING AND PROMOTING THE FARMING COMMUNITY.,,,,,MEMBERSHIP - COSTS OF PROMOTING FARM BUREAU AND ITS MISSION. PROMOTION OF FARM BUREAU PROGRAMS AND EVENTS IN ORDER TO EDUCATE THE FARMER AND CONSUMER IN CURRENT FARMING AND FOOD ISSUES.,,,,,"CONFERENCE, CONVENTIONS AND MEETINGS - EDUCATION OF VOLUNTEERS FOR THE PROMOTING AND MARKETING OF FARM ISSUES AND CURRENT EVENTS.",,,,,,,,Improve rural standard of living.,,0,0,0,0.0,0,0,0.0,0,0.0,0,0,,0.0,,7,0,0,1.0,1,1.0,1.0,0,1,0,0,0,1,0,1,,1.0,0,,0,0,1,1.0,1,1.0,10,10,0,1.0,,,,OH,1,,0,0,,,,0,,2130.0,,,,,,,,,,,,,,236036.0,,,,236036.0,26001.0,,272756,,181041.0,7018.0,,6770.0,,6770.0,,,,,697.0,,,3478.0,2283.0,42350,,,,331.0,,,,,,,1860.0,1860.0,2352.0,,,,,,,,,,,,21245.0,21245.0,,,,,,,,,,269425,,17563.0,251862.0,6278.0,733.0,,1197.0,109784.0,,,,,11076.0,,,27194.0,,,,,74474.0,72904.0,,,,233959.0,165116.0,55332.0,,,1.0,,386585.0,,,,,,,,424855,,,,,3331,,,0,1.0,,,,,0.0,0,,,0,109 Portage St,,Woodville,43469,,OH,2018


In [6]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df = pd.read_feather('D:/all_filings_april_2025_all_controls_combined_parsed_type.feather')
print('# of columns:', len(df.columns))
print('# of observations:', len(df))
df[:1]

Current date and time :  2025-04-18 21:49:35 

# of columns: 306
# of observations: 3469008
CPU times: total: 1min 39s
Wall time: 1min 2s


Unnamed: 0,_id,OrganizationName,URL,DLN,TaxPeriod,F9_09_PC_FEES_FOR_SVCE_FR_TOT,F9_00_HD_BUILD_TIME_STAMP,fiscal_year,EIN,Name,NameControl,Phone,USAddress,ForeignAddress,InCareOfName,BusinessName,BusinessNameControlTxt,PhoneNum,InCareOfNm,ForeignPhoneNum,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_BEGIN,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_03_PC_PGMSVC_SIGNIF_CHG,F9_03_PC_PGMSVC_SIGNIF_NEW,F9_03_PC_PROG_SVC_ACC_1_CODE,F9_03_PC_PROG_SVC_ACC_1_DESC,F9_03_PC_PROG_SVC_ACC_1_EXP,F9_03_PC_PROG_SVC_ACC_1_GRNT,F9_03_PC_PROG_SVC_ACC_1_REV,F9_03_PC_PROG_SVC_ACC_2_CODE,F9_03_PC_PROG_SVC_ACC_2_DESC,F9_03_PC_PROG_SVC_ACC_2_EXP,F9_03_PC_PROG_SVC_ACC_2_GRNT,F9_03_PC_PROG_SVC_ACC_2_REV,F9_03_PC_PROG_SVC_ACC_3_CODE,F9_03_PC_PROG_SVC_ACC_3_DESC,F9_03_PC_PROG_SVC_ACC_3_EXP,F9_03_PC_PROG_SVC_ACC_3_GRNT,F9_03_PC_PROG_SVC_ACC_3_REV,F9_03_PC_TOT_OTH_PROG_SVC_EXP,F9_03_PC_TOT_OTH_PROG_SVC_GRNT,F9_03_PC_TOT_OTH_PROG_SVC_REV,F9_03_PC_TOT_PROG_SVC_EXPENSE,F9_03_PZ_MISSION_DESCRIPTION,F9_03_PZ_SCHEDULE_O_PART3,F9_04_PC_ACTVITIES_VIA_PARTNER,F9_04_PC_CONTROLLED_ENTITY,F9_04_PC_DISREGARDED_ENTITY,F9_04_PC_EXCESS_BENEFIT_TRANS,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_LOBBYING_ACTIVITIES,F9_04_PC_POLITICAL_ACTIVITIES,F9_04_PC_PRIOR_EXCESS_BEN_TRAN,F9_04_PC_PROF_FR_EXP_GT_15K,F9_04_PC_RELATED_ENTITY,F9_04_PC_TRANS_TO_CNTRLD_ENT,F9_04_PC_TRANS_WITH_CNTRLD_ENT,F9_05_EXP_SCHED_O_X,F9_05_PC_NUMBER_EMPLOYEES_W3,F9_05_PC_NUMBER_FORMS_1096,F9_05_PC_UNRELATED_BUS_INCOME,F9_06_EXP_SCHED_O_X,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_EXP_SCHED_O_X,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_EXP_SCHED_O_X,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_EXP_AD_PROMO_TOT,F9_09_EXP_BENF_PAID_MEMB_TOT,F9_09_EXP_CONF_MEETING_TOT,F9_09_EXP_DEPREC_FUNDR,F9_09_EXP_DEPREC_MAG,F9_09_EXP_DEPREC_PROG,F9_09_EXP_DEPREC_TOT,F9_09_EXP_GRANT_FRGN_TOT,F9_09_EXP_GRANT_INDIV_DMSTC_TOT,F9_09_EXP_GRANT_ORG_DMSTC_TOT,F9_09_EXP_INFO_TECH_TOT,F9_09_EXP_INSURANCE_TOT,F9_09_EXP_INTEREST_TOT,F9_09_EXP_JOINT_COSTS_TOT,F9_09_EXP_OCCUPANCY_TOT,F9_09_EXP_OFFICE_TOT,F9_09_EXP_OTH_OTH_TOT,F9_09_EXP_ROY_TOT,F9_09_EXP_SCHED_O_X,F9_09_EXP_TRAVEL_ENTRTNMNT_TOT,F9_09_EXP_TRAVEL_TOT,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYMENT_TO_AFFILIATES,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_ASSETS_ACC_NET_EOY,F9_10_ASSETS_EXP_PREPAID_EOY,F9_10_ASSETS_INTANGIB_EOY,F9_10_ASSETS_INVENT_SALE_EOY,F9_10_ASSETS_LESS_DEPREC_EOY,F9_10_ASSETS_LOANS_DISQUAL_EOY,F9_10_ASSETS_NOTES_LOANS_NET_EOY,F9_10_ASSETS_OTH_EOY,F9_10_ASSETS_PLEDGES_NET_EOY,F9_10_LIAB_ACC_PAYABLE_EOY,F9_10_LIAB_GRANTS_PAYABLE_EOY,F9_10_LIAB_LOANS_OFF_EOY,F9_10_LIAB_REV_DEFERRED_EOY,F9_10_NAFB_RESTRICT_PERM_EOY,F9_10_NAFB_RESTRICT_TEMP_EOY,F9_10_NAFB_UNRESTRICT_EOY,F9_10_PC_BOND_LIABILITY_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_ESCROW_LIABILITY_EOY,F9_10_PC_INVEST_OTHER_SEC_EOY,F9_10_PC_INVEST_PROG_RELTD_EOY,F9_10_PC_INVEST_PUB_TRADED_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_SECURE_MORT_NOTES_EOY,F9_10_PC_UNSECURED_LOANS_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_10_SCHED_O_X,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_11_SCHED_O_X,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,F9_12_SCHED_O_X,number_of_other_prog_svces,501c3,F9_00_HD_FILER_ADDR_US_L1,F9_00_HD_FILER_ADDR_US_L2,F9_00_HD_FILER_CITY_US,F9_00_HD_FILER_ZIP_US,F9_00_HD_FILER_COUNTRY_FRGN,F9_00_HD_FILER_STATE_US,F9_00_HD_TIME_STAMP_yr
0,5d019e6778ffca27b42818d7,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,,2016-02-24 21:20:13+00:00,,232705170,"{'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}",RONA,8565826843,"{'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300', 'AddressLine1Txt': None, 'AddressLine2': None, 'AddressLine2Txt': None, 'City': 'BETHLEHEM', 'CityNm': None, 'State': 'PA', 'StateAbbreviationCd': None, 'ZIPCd': None, 'ZIPCode': '18017'}",,,,,,,,1,0,,0,,1,0,,1473903,0,0,0,MICHAEL ANTON,2011-11-04,,PA,2010-01-01,2010-12-31,2010,2011-11-09 12:41:09+00:00,0,1,0,,0,,1992,0,1439340,1044925,638637,10,30447,1753405,243131,0,0,0,0,89152,193604,0,2440859,881768,195892,0,0,450430,1075372,0,0,10,0,925000,33563,1990429,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,459751,1000,0,0,0,1925215,1384751,171810,1473903,0,0,,"RMHC OF THE PHILADELPHIA REGION, INC. GRANTS HUNDREDS OF THOUSANDS OF DOLLARS PER YEAR TO SUPPORT NON-PROFIT PROGRAMS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN. LOCALLY, RMHC SUPPORTS THE PHILADELPHIA, SOUTHERN NEW JERSEY AND DE...",1043744,925000,,,,,,,,,,,,,,,1043744,"THE CORPORATION IS ORGANIZED AND WILL BE OPERATED EXCLUSIVELY FOR CHARITABLE, EDUCATIONAL AND SCIENTIFIC PURPOSES WITHIN THE MEANING OF SECTION 501(C)(3) OF THE INTERNAL REVENUE CODE. SUCH PURPOSES SHALL BE LIMITED TO PROVIDING SUPPORT AND FUNDIN...",1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,1,10,10,0,0,0,0,0,"[""PA"", ""NJ"", ""DE""]",0,0,0,0,1,0,0,0,0,0,0,0,1439340,,,,,,,,,,,,,,,1439340,1000,,1473903,,,,,86228,,86228,,33000,892000,,,,,,123,763,,0,,,,,,,,,,,21675,,215,,,,,,,,,,,,118744,,,,,,,,,1384751,195892,145115,1043744,147981,,,,170617,,,,,44353,166000,,,,,1990429,,,,,1851561,,,256845,86228,,1,0,240077,,332660,270700,,,,,,2440859,0,,,,89152,,1,0,1,0,,1,0,0,1,1,,1,1525 VALLEY CENTER PARKWAY NO 300,,BETHLEHEM,18017,,PA,2011


In [8]:
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3469008 entries, 0 to 3469007
Data columns (total 306 columns):
 #    Column                              Dtype              
---   ------                              -----              
 0    _id                                 object             
 1    OrganizationName                    object             
 2    URL                                 object             
 3    DLN                                 object             
 4    TaxPeriod                           object             
 5    F9_09_PC_FEES_FOR_SVCE_FR_TOT       Int64              
 6    F9_00_HD_BUILD_TIME_STAMP           datetime64[ns, UTC]
 7    fiscal_year                         object             
 8    EIN                                 object             
 9    Name                                object             
 10   NameControl                         object             
 11   Phone                               object             
 12   USAddress   

#### Create numeric version of EIN

In [9]:
%%time
df['ein_int'] = df['EIN'].astype('int')
print(len(df[df['ein_int'].isnull()]))

0
CPU times: total: 641 ms
Wall time: 648 ms


### Inspect Data to See Which Columns can be filled
Next Run Don't Need to do this -- Use Instead the Updated *concordance* File to Identify Columns that Should Be Excluded

In [14]:
#pd.set_option('display.float_format', lambda x: '%.0f' % x)

In [15]:
#%%time 
#df[df.columns.tolist()[:35]].describe(percentiles=[]).T

Wall time: 2.29 s


Unnamed: 0,count,mean,std,min,50%,max
F9_09_PC_FEES_FOR_SVCE_FR_TOT,460920,19137,244617,-35000,0,32764282
F9_00_HD_ADDR_CHANGE,79364,1,0,1,1,1
F9_00_HD_AMENDED_RETURN,17904,1,0,1,1,1
F9_00_HD_EXEMPT_STATUS_4847A1,1530,1,0,1,1,1
F9_00_HD_EXEMPT_STATUS_501C,497060,7,4,2,6,29
F9_00_HD_EXEMPT_STATUS_501C3,1518034,1,0,1,1,1
F9_00_HD_FINAL_RETURN,10818,1,0,1,1,1
F9_00_HD_GROSS_RCPT,2016624,15925354,506110199,0,578622,310516974055
F9_00_HD_GROUP_RETURN,2016624,0,0,0,0,1
F9_00_HD_INCLUDES_SUBORD_ORGS,304368,0,0,0,0,1


In [16]:
#exclude_cols = ['F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_YEAR_FORMED']

In [17]:
#%%time 
#df[df.columns.tolist()[35:60]].describe(percentiles=[]).T

Wall time: 2.42 s


Unnamed: 0,count,mean,std,min,50%,max
F9_01_PC_BEN_PAID_MEMB_PRIOR,935423,1552650,47717976,-189170543,0,8893847473
F9_01_PC_CONTR_GRANTS_CURR,2016624,1836699,26891329,-654611,117422,9265119609
F9_01_PC_CONTR_GRANTS_PRIOR,1743979,2014775,26973683,-900000,161932,9265119609
F9_01_PC_GRANTS_PRIOR,1146578,1137292,17438956,-341071,0,4429165079
F9_01_PC_INDEP_VOTING_MEMB,2016624,19,710,0,8,830201
F9_01_PC_INVEST_INCOME_PRIOR,1693032,470824,16235451,-5148733602,1112,9166085042
F9_01_PC_NET_ASSETS_BOY,1990668,11454726,229753689,-5034822702,515972,61288153211
F9_01_PC_OTHER_EXPENSE_PRIOR,1945422,4669423,115248726,-186053238,238375,50467127024
F9_01_PC_OTHER_REV_PRIOR,1514063,317373,5824588,-651010000,9829,1723980625
F9_01_PC_PROF_FUNDRISING_EXP_CURR,2016624,4390,117420,-35000,0,32764282


In [44]:
#%%time 
#df[df.columns.tolist()[60:90]].describe(percentiles=[]).T

Wall time: 2.43 s


Unnamed: 0,count,mean,std,min,50%,max
F9_01_PZ_BEN_PAID_TO_MEMB_CURR,1895016,783262,34334848,-189170543,0,8893847473
F9_01_PZ_GRANTS_PAID_CURR,1895016,686255,13904960,-341071,0,4798368744
F9_01_PZ_INVEST_INCOME_CURR,1895016,457764,15664382,-802196000,446,9166085042
F9_01_PZ_NAFB_EOY,1895016,12041141,244661399,-5034822702,529860,61288153211
F9_01_PZ_OTHER_EXPENSE_CURR,1895016,4759825,123210724,-1530406504,239436,54619014197
F9_01_PZ_OTHER_REV_CURR,1895016,251709,5305318,-123236329,1186,1723980625
F9_01_PZ_PROG_SERVICE_REV_CURR,1895016,8198511,161715891,-131946331,96777,58512193717
F9_01_PZ_SALARIES_CURR,1895016,3870871,50569654,-10284019,117947,9300950001
F9_01_PZ_SALARIES_PRIOR,1529943,4550316,53239673,-2179118,199826,8438687895
F9_01_PZ_TOT_ASSETS_BOY,1873181,21544872,377655439,-98344486,777896,90967341073


In [18]:
#mgt_outsourcing_cols = ['F9_06_PC_CHANGES_ORGANIZING_DOCS', 'F9_06_PC_DELEGATION_MGT_DUTIES', 
#                        'F9_06_PC_DELEGATION_OF_MGT']

In [19]:
#%%time 
#df[df.columns.tolist()[90:120]].describe(percentiles=[]).T

Wall time: 2.61 s


Unnamed: 0,count,mean,std,min,50%,max
F9_06_PC_FORM_AVAIL_OWN_WEBSITE,126888,1,0,1,1,1
F9_06_PC_FORM_UPON_REQUEST,1834046,1,0,1,1,1
F9_06_PC_JOINT_VENTURE_INVESTMNT,2016624,0,0,0,0,1
F9_06_PC_JOINT_VENTURE_POLICY,103422,0,0,0,0,1
F9_06_PC_LOCAL_CHAPTERS,2016624,0,0,0,0,1
F9_06_PC_MATERIAL_DIVERSION,2016624,0,0,0,0,1
F9_06_PC_MEMBERS_OR_STOCKHOLDERS,2016624,0,0,0,0,1
F9_06_PC_MINUTES_COMMITTEES,2009884,1,0,0,1,1
F9_06_PC_MINUTES_GOVERNING_BODY,2016624,1,0,0,1,1
F9_06_PC_MONITORING_OF_COI_POLICY,1408508,1,0,0,1,1


In [20]:
#%%time 
#df[df.columns.tolist()[120:150]].describe().T

Wall time: 2.98 s


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
F9_08_PC_COST_OF_GOODS_SOLD,275463,427331,5526798,-9864931,2222,24381,125620,754102909
F9_08_PC_FEDERATED_CAMPAIGNS,136633,171843,1713053,-351994,0,16855,81294,127014579
F9_08_PC_FUNDRAISING_DIRECT_EXP,555491,77544,454528,-116677,6027,21551,62470,51988787
F9_08_PC_FUNDRAISING_EVENTS,358923,225267,2957325,-112579,8332,37798,131226,487001193
F9_08_PC_FUNDRAISING_GROSS_INC,577026,101225,458897,-435270,10350,35114,94985,51988787
F9_08_PC_GAMING_DIRECT_EXPENSES,114027,240232,946935,-2096,0,5599,109892,38353740
F9_08_PC_GAMING_GROSS_INCOME,119634,271672,1167930,-3933,0,16132,150605,65651549
F9_08_PC_GOVERNMENT_GRANTS,561940,2572427,28810937,-386453,41782,198100,883303,7239962734
F9_08_PC_GROSS_SALES_INVENTORY,290730,697331,8357139,-626702,3953,39802,209760,1277677982
F9_08_PC_MEMBERSHIP_DUES,351241,266223,2570873,-350967,3385,23320,126030,357505219


In [21]:
#%%time 
#df[df.columns.tolist()[150:180]].describe().T

Wall time: 3.32 s


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
F9_09_PC_FEES_FOR_SVCE_OTH_TOT,1199579,1055558,17785822,-86875965,175,12867,99506,4597888329
F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,353837,27916,222277,-1756914,448,3240,13222,19568750
F9_09_PC_OTHER_EMP_BEN_MGMT,662722,185024,3178921,-18045287,2763,11418,49902,1223889692
F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,754465,930219,9055350,-21546453,11089,45937,238109,1363191407
F9_09_PC_OTHER_EMP_BEN_TOTAL,1167653,743259,8458915,-21467190,595,25807,150501,1375814505
F9_09_PC_OTHER_SALARY_FUNDRAISE,456718,148277,1091520,-226274,2934,20572,82237,99085531
F9_09_PC_OTHER_SALARY_MGMT,868698,779671,7738614,-12764001,15806,52515,206040,956452259
F9_09_PC_OTHER_SALARY_PROG_SVCE,1055691,4694436,48022529,-2018148,64163,219543,1024768,6256732477
F9_09_PC_OTHER_SALARY_TOTAL,1440869,4079603,46152613,-396773,37925,168336,812085,6612941799
F9_09_PC_PAYROLL_TAX_FUNDRAISE,467625,11724,73750,-21772,416,2113,7430,7543216


In [22]:
#%%time 
#df[df.columns.tolist()[180:]].describe().T

Wall time: 1.99 s


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
F9_10_PC_RET_EARNINGS_ENDWMT_EOY,436810,6538802,228769189,-4218815300,41186,232986,871005,60829733185
F9_10_PC_SAVINGS_TEMP_INVEST_BOY,1172942,2299952,35660556,-101457150,43507,174826,626400,12651079908
F9_10_PC_SAVINGS_TEMP_INVEST_EOY,1314325,2131916,35899928,-253206305,18871,132272,531813,12651079908
F9_10_PC_SECURED_MORTGAGES_EOY,501390,4781785,74455719,-5139991,42422,365068,1718636,15858200062
F9_10_PC_UNSECURED_NOTES_BOY,193986,2646852,82284521,-2928622,0,2760,99726,10946624716
F9_10_PC_UNSECURED_NOTES_EOY,196956,2707583,77789113,-3154821,0,3500,100000,10064580236
F9_10_PZ_TOTAL_ASSETS_EOY,2016624,22455448,393951605,-98344486,213678,792839,3383613,90967341073
F9_11_PC_RECNCLTN_DONATED_SVCES,55975,43267,1125266,-73279162,0,0,2005,89834062
F9_11_PC_RECNCLTN_INVSTMNT_EXP,44651,-2010,303901,-16902760,0,0,0,39996125
F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,153580,-35569,15319665,-4440535144,-4325,0,2775,2526615691


In [8]:
#exclude_cols = exclude_cols +['']
#print(exclude_cols)

### Read in *concordance* file to see which columns should not be filled

In [10]:
concordance = pd.read_excel('concordance_VERIFIED.xlsx')
print('# of columns:', len(concordance.columns))
print('# of observations:', len(concordance))
concordance[:1]

# of columns: 17
# of observations: 574


Unnamed: 0,xpath,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,python_data_type,fill_null,BINARIZE,MongoDB_Name,sub_key,sub_sub_key,cardinality
0,/Return/ReturnData/IRS990/SpecialConditionDesc,F9_00_HD_SPECIAL_CONDITION_DESC,,,,,Special condition description,F990-PC-PART-00,PART-00,TextType,string,Do not fill null,,SpecialConditionDesc,,,


In [10]:
#def agg_funcs(x):
#    names = {
#        'data_type_xsd': x['data_type_xsd'].head(1).values[0],
#        'python_data_type': x['python_data_type'].head(1).values[0],
#        'fill_null': x['fill_null'].head(1).values[0],       
#    }
#    #THE FOLLOWING SHORTCUT WORKS BUT CHANGES THE ORDER OF THE COLUMNS
#    #return pd.Series(names, index = list(names.keys()))
#    return pd.Series(names, index=['data_type_xsd', 'python_data_type', 'fill_null'])
#new_variables_df = concordance.groupby(['variable_name_new']).apply(agg_funcs)
#new_variables_df = new_variables_df.reset_index()
#print('# of variables:', len(new_variables_df))
#new_variables_df[:]

# of variables: 288


  new_variables_df = concordance.groupby(['variable_name_new']).apply(agg_funcs)


Unnamed: 0,variable_name_new,data_type_xsd,python_data_type,fill_null
0,F9_00_HD_ADDR_CHANGE,CheckboxType,Int64,
1,F9_00_HD_AMENDED_RETURN,CheckboxType,Int64,
2,F9_00_HD_BUILD_TIME_STAMP,TimestampType,DateTime,Do not fill null
3,F9_00_HD_CTRY_OF_DOMICILE,CountryType,string,Do not fill null
4,F9_00_HD_EXEMPT_STATUS_4847A1,CheckboxType,Int64,
...,...,...,...,...
283,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,BooleanType,Int64,
284,F9_12_PC_FINCL_STMTS_AUDITED,BooleanType,Int64,
285,F9_12_SCHED_O_X,CheckboxType,Int64,
286,TaxPeriod,YearMonthType,string,Do not fill null


In [12]:
%%time
new_variables_df = (
    concordance
    .groupby('variable_name_new')
    .agg(
        data_type_xsd=('data_type_xsd', 'first'),
        python_data_type=('python_data_type', 'first'),
        fill_null=('fill_null', 'first')        
    )
    .reset_index()
  )

print('# of variables:', len(new_variables_df))
new_variables_df

# of variables: 288
CPU times: total: 0 ns
Wall time: 9.47 ms


Unnamed: 0,variable_name_new,data_type_xsd,python_data_type,fill_null
0,F9_00_HD_ADDR_CHANGE,CheckboxType,Int64,
1,F9_00_HD_AMENDED_RETURN,CheckboxType,Int64,
2,F9_00_HD_BUILD_TIME_STAMP,TimestampType,DateTime,Do not fill null
3,F9_00_HD_CTRY_OF_DOMICILE,CountryType,string,Do not fill null
4,F9_00_HD_EXEMPT_STATUS_4847A1,CheckboxType,Int64,
...,...,...,...,...
283,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,BooleanType,Int64,
284,F9_12_PC_FINCL_STMTS_AUDITED,BooleanType,Int64,
285,F9_12_SCHED_O_X,CheckboxType,Int64,
286,TaxPeriod,YearMonthType,string,Do not fill null


In [13]:
new_variables_df['fill_null'].value_counts()

fill_null
Do not fill null    32
Name: count, dtype: int64

In [14]:
new_variables_df[new_variables_df['fill_null']=='Do not fill null']['variable_name_new'].tolist()

['F9_00_HD_BUILD_TIME_STAMP',
 'F9_00_HD_CTRY_OF_DOMICILE',
 'F9_00_HD_EXEMPT_STATUS_501C',
 'F9_00_HD_FILER_ADDR_US_L1',
 'F9_00_HD_FILER_ADDR_US_L2',
 'F9_00_HD_FILER_CITY_US',
 'F9_00_HD_FILER_COUNTRY_FRGN',
 'F9_00_HD_FILER_STATE_US',
 'F9_00_HD_FILER_ZIP_US',
 'F9_00_HD_GROSS_EXEMPT_NUM',
 'F9_00_HD_PRIN_OFF_NAME',
 'F9_00_HD_SIGNING_OFFICER_SIGNTR',
 'F9_00_HD_SPECIAL_CONDITION_DESC',
 'F9_00_HD_STATE_OF_DOMICILE',
 'F9_00_HD_TAX_PER_BEGIN',
 'F9_00_HD_TAX_PER_END',
 'F9_00_HD_TAX_YEAR',
 'F9_00_HD_TIME_STAMP',
 'F9_00_HD_TYPE_ORG_OTHER_DESC',
 'F9_00_HD_WEBSITE',
 'F9_00_HD_YEAR_FORMED',
 'F9_01_PZ_ORGANIZATIONAL_MISSION',
 'F9_03_PC_PROG_SVC_ACC_1_CODE',
 'F9_03_PC_PROG_SVC_ACC_1_DESC',
 'F9_03_PC_PROG_SVC_ACC_2_CODE',
 'F9_03_PC_PROG_SVC_ACC_2_DESC',
 'F9_03_PC_PROG_SVC_ACC_3_CODE',
 'F9_03_PC_PROG_SVC_ACC_3_DESC',
 'F9_03_PZ_MISSION_DESCRIPTION',
 'F9_06_PC_STATES_WHERE_RET_FILED',
 'F9_12_PC_ACCTG_METHOD_OTHER',
 'TaxPeriod']

In [15]:
print(len(set(new_variables_df[new_variables_df['fill_null']=='Do not fill null']['variable_name_new'].tolist())))

32


In [16]:
string_cols = df.select_dtypes(include='object').columns.tolist()
print(len(string_cols))
print(string_cols, '\n')
#df[string_cols].describe().T

44
['_id', 'OrganizationName', 'URL', 'DLN', 'TaxPeriod', 'fiscal_year', 'EIN', 'Name', 'NameControl', 'Phone', 'USAddress', 'ForeignAddress', 'InCareOfName', 'BusinessName', 'BusinessNameControlTxt', 'PhoneNum', 'InCareOfNm', 'ForeignPhoneNum', 'F9_00_HD_CTRY_OF_DOMICILE', 'F9_00_HD_GROSS_EXEMPT_NUM', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_SIGNING_OFFICER_SIGNTR', 'F9_00_HD_SPECIAL_CONDITION_DESC', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TAX_PER_BEGIN', 'F9_00_HD_TAX_PER_END', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_WEBSITE', 'F9_01_PZ_ORGANIZATIONAL_MISSION', 'F9_03_PC_PROG_SVC_ACC_1_DESC', 'F9_03_PC_PROG_SVC_ACC_2_DESC', 'F9_03_PC_PROG_SVC_ACC_3_DESC', 'F9_03_PZ_MISSION_DESCRIPTION', 'F9_06_PC_STATES_WHERE_RET_FILED', 'F9_09_EXP_OTH_OTH_TOT', 'F9_12_PC_ACCTG_METHOD_OTHER', 'number_of_other_prog_svces', 'F9_00_HD_FILER_ADDR_US_L1', 'F9_00_HD_FILER_ADDR_US_L2', 'F9_00_HD_FILER_CITY_US', 'F9_00_HD_FILER_ZIP_US', 'F9_00_HD_FILER_COUNTRY_FRGN', 'F9_00_HD_FILER_STATE_US', 'F9_00_HD_TIME_ST

In [17]:
set(new_variables_df[new_variables_df['fill_null']=='Do not fill null']['variable_name_new'].tolist()) - set(string_cols)

{'F9_00_HD_BUILD_TIME_STAMP',
 'F9_00_HD_EXEMPT_STATUS_501C',
 'F9_00_HD_TAX_YEAR',
 'F9_00_HD_TIME_STAMP',
 'F9_00_HD_YEAR_FORMED',
 'F9_03_PC_PROG_SVC_ACC_1_CODE',
 'F9_03_PC_PROG_SVC_ACC_2_CODE',
 'F9_03_PC_PROG_SVC_ACC_3_CODE'}

In [18]:
set(string_cols) - set(new_variables_df[new_variables_df['fill_null']=='Do not fill null']['variable_name_new'].tolist())

{'BusinessName',
 'BusinessNameControlTxt',
 'DLN',
 'EIN',
 'F9_00_HD_TIME_STAMP_yr',
 'F9_09_EXP_OTH_OTH_TOT',
 'ForeignAddress',
 'ForeignPhoneNum',
 'InCareOfName',
 'InCareOfNm',
 'Name',
 'NameControl',
 'OrganizationName',
 'Phone',
 'PhoneNum',
 'URL',
 'USAddress',
 '_id',
 'fiscal_year',
 'number_of_other_prog_svces'}

In [19]:
no_fill_cols = list(set(string_cols + new_variables_df[new_variables_df['fill_null']=='Do not fill null']['variable_name_new'].tolist()))
print(len(no_fill_cols))

52


In [20]:
set(string_cols) - set(no_fill_cols)

set()

In [21]:
set(no_fill_cols) - set(string_cols)

{'F9_00_HD_BUILD_TIME_STAMP',
 'F9_00_HD_EXEMPT_STATUS_501C',
 'F9_00_HD_TAX_YEAR',
 'F9_00_HD_TIME_STAMP',
 'F9_00_HD_YEAR_FORMED',
 'F9_03_PC_PROG_SVC_ACC_1_CODE',
 'F9_03_PC_PROG_SVC_ACC_2_CODE',
 'F9_03_PC_PROG_SVC_ACC_3_CODE'}

In [22]:
set(new_variables_df[new_variables_df['fill_null']=='Do not fill null']['variable_name_new'].tolist()) - set(no_fill_cols)

set()

In [23]:
set(no_fill_cols) - set(new_variables_df[new_variables_df['fill_null']=='Do not fill null']['variable_name_new'].tolist())

{'BusinessName',
 'BusinessNameControlTxt',
 'DLN',
 'EIN',
 'F9_00_HD_TIME_STAMP_yr',
 'F9_09_EXP_OTH_OTH_TOT',
 'ForeignAddress',
 'ForeignPhoneNum',
 'InCareOfName',
 'InCareOfNm',
 'Name',
 'NameControl',
 'OrganizationName',
 'Phone',
 'PhoneNum',
 'URL',
 'USAddress',
 '_id',
 'fiscal_year',
 'number_of_other_prog_svces'}

In [24]:
no_fill_cols

['F9_03_PZ_MISSION_DESCRIPTION',
 'F9_00_HD_FILER_ADDR_US_L2',
 'F9_00_HD_WEBSITE',
 'NameControl',
 'TaxPeriod',
 'InCareOfNm',
 'F9_03_PC_PROG_SVC_ACC_1_DESC',
 'OrganizationName',
 'F9_09_EXP_OTH_OTH_TOT',
 'BusinessNameControlTxt',
 '_id',
 'F9_00_HD_SPECIAL_CONDITION_DESC',
 'F9_00_HD_FILER_CITY_US',
 'URL',
 'Name',
 'F9_00_HD_TYPE_ORG_OTHER_DESC',
 'F9_03_PC_PROG_SVC_ACC_3_DESC',
 'F9_06_PC_STATES_WHERE_RET_FILED',
 'F9_03_PC_PROG_SVC_ACC_1_CODE',
 'F9_00_HD_FILER_COUNTRY_FRGN',
 'DLN',
 'F9_00_HD_FILER_ADDR_US_L1',
 'ForeignAddress',
 'F9_00_HD_TIME_STAMP_yr',
 'EIN',
 'F9_00_HD_FILER_STATE_US',
 'F9_00_HD_SIGNING_OFFICER_SIGNTR',
 'F9_03_PC_PROG_SVC_ACC_2_DESC',
 'F9_00_HD_TIME_STAMP',
 'F9_00_HD_TAX_PER_END',
 'F9_00_HD_BUILD_TIME_STAMP',
 'Phone',
 'F9_00_HD_EXEMPT_STATUS_501C',
 'InCareOfName',
 'F9_00_HD_GROSS_EXEMPT_NUM',
 'F9_00_HD_TAX_PER_BEGIN',
 'PhoneNum',
 'F9_00_HD_TAX_YEAR',
 'F9_00_HD_CTRY_OF_DOMICILE',
 'BusinessName',
 'F9_00_HD_STATE_OF_DOMICILE',
 'ForeignPho

In [25]:
no_fill_cols.remove('TaxPeriod')
#THIS VARIABLE WAS float64
no_fill_cols.remove('F9_00_HD_FILER_ADDR_US_L2')

In [26]:
print(len(no_fill_cols))

50


In [25]:
#%%time
#for col in no_fill_cols:
#    print(col, ': ', len(df[df[col].isnull()]))

F9_12_PC_ACCTG_METHOD_OTHER :  871932
F9_00_HD_EXEMPT_STATUS_501C :  665600
F9_03_PC_PROG_SVC_ACC_2_CODE :  891980
USAddress :  1300
F9_03_PC_PROG_SVC_ACC_2_DESC :  631339
F9_00_HD_GROSS_EXEMPT_NUM :  862375
F9_00_HD_SIGNING_OFFICER_SIGNTR :  0
InCareOfNm :  848199
F9_00_HD_FILER_ZIP_US :  1300
F9_00_HD_FILER_COUNTRY_FRGN :  890680
F9_00_HD_CTRY_OF_DOMICILE :  891102
F9_03_PC_PROG_SVC_ACC_3_CODE :  891980
F9_00_HD_YEAR_FORMED :  56277
URL :  0
F9_03_PZ_MISSION_DESCRIPTION :  2421
ForeignAddress :  890680
F9_00_HD_TAX_PER_BEGIN :  0
F9_09_EXP_OTH_OTH_TOT :  332736
F9_00_HD_SPECIAL_CONDITION_DESC :  890838
F9_00_HD_TIME_STAMP :  0
PhoneNum :  107710
F9_01_PZ_ORGANIZATIONAL_MISSION :  0
ForeignPhoneNum :  889365
F9_00_HD_TIME_STAMP_yr :  0
EIN :  0
BusinessNameControlTxt :  0
F9_03_PC_PROG_SVC_ACC_1_DESC :  0
F9_00_HD_WEBSITE :  113292
F9_00_HD_PRIN_OFF_NAME :  118313
fiscal_year :  0
F9_00_HD_BUILD_TIME_STAMP :  0
F9_03_PC_PROG_SVC_ACC_1_CODE :  890474
F9_00_HD_TYPE_ORG_OTHER_DESC :  873

<br>New, faster method

In [30]:
%%time
df[no_fill_cols].isnull().sum()

CPU times: total: 9 s
Wall time: 9.23 s


F9_03_PZ_MISSION_DESCRIPTION          6892
F9_00_HD_WEBSITE                    393171
NameControl                        2973485
InCareOfNm                         3336073
F9_03_PC_PROG_SVC_ACC_1_DESC             0
OrganizationName                   1276573
F9_09_EXP_OTH_OTH_TOT              1188407
BusinessNameControlTxt              495523
_id                                      0
F9_00_HD_SPECIAL_CONDITION_DESC    3466060
F9_00_HD_FILER_CITY_US                3742
URL                                      0
Name                               2973485
F9_00_HD_TYPE_ORG_OTHER_DESC       3403633
F9_03_PC_PROG_SVC_ACC_3_DESC       2740729
F9_06_PC_STATES_WHERE_RET_FILED    1686392
F9_03_PC_PROG_SVC_ACC_1_CODE       3463424
F9_00_HD_FILER_COUNTRY_FRGN        3465266
DLN                                1276573
F9_00_HD_FILER_ADDR_US_L1             3742
ForeignAddress                     3465266
F9_00_HD_TIME_STAMP_yr                   0
EIN                                      0
F9_00_HD_FI

<br>And this shows how many rows are *not* missing values

In [29]:
%%time
df[no_fill_cols].count()

CPU times: total: 8.94 s
Wall time: 9.31 s


F9_03_PZ_MISSION_DESCRIPTION       3462116
F9_00_HD_WEBSITE                   3075837
NameControl                         495523
InCareOfNm                          132935
F9_03_PC_PROG_SVC_ACC_1_DESC       3469008
OrganizationName                   2192435
F9_09_EXP_OTH_OTH_TOT              2280601
BusinessNameControlTxt             2973485
_id                                3469008
F9_00_HD_SPECIAL_CONDITION_DESC       2948
F9_00_HD_FILER_CITY_US             3465266
URL                                3469008
Name                                495523
F9_00_HD_TYPE_ORG_OTHER_DESC         65375
F9_03_PC_PROG_SVC_ACC_3_DESC        728279
F9_06_PC_STATES_WHERE_RET_FILED    1782616
F9_03_PC_PROG_SVC_ACC_1_CODE          5584
F9_00_HD_FILER_COUNTRY_FRGN           3742
DLN                                2192435
F9_00_HD_FILER_ADDR_US_L1          3465266
ForeignAddress                        3742
F9_00_HD_TIME_STAMP_yr             3469008
EIN                                3469008
F9_00_HD_FI

##### Fix one row
Now done in an earlier notebook

In [23]:
#pd.set_option('max_colwidth', 500)

In [27]:
#df[df['OrganizationName'].isnull()][['EIN', '501c3']]

In [25]:
#df.loc[1895015, 'OrganizationName'] = 'PLAY FLAG FOOTBALL'
#df.loc[[1895015]]

# Create version with null values filled

In [31]:
pd.set_option('display.float_format', lambda x: '%.0f' % x)

In [32]:
%%time 
missingcols = list(df.columns[df.isnull().any()])
print(len(missingcols))
print(missingcols)

187
['OrganizationName', 'DLN', 'TaxPeriod', 'F9_09_PC_FEES_FOR_SVCE_FR_TOT', 'fiscal_year', 'Name', 'NameControl', 'Phone', 'USAddress', 'ForeignAddress', 'InCareOfName', 'BusinessName', 'BusinessNameControlTxt', 'PhoneNum', 'InCareOfNm', 'ForeignPhoneNum', 'F9_00_HD_CTRY_OF_DOMICILE', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_GROSS_EXEMPT_NUM', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_SPECIAL_CONDITION_DESC', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_WEBSITE', 'F9_00_HD_YEAR_FORMED', 'F9_01_PC_BEN_PAID_MEMB_PRIOR', 'F9_01_PC_CONTR_GRANTS_PRIOR', 'F9_01_PC_GRANTS_PRIOR', 'F9_01_PC_INVEST_INCOME_PRIOR', 'F9_01_PC_NET_ASSETS_BOY', 'F9_01_PC_OTHER_EXPENSE_PRIOR', 'F9_01_PC_OTHER_REV_PRIOR', 'F9_01_PC_PROF_FUNDRISING_EXP_PRIOR', 'F9_01_PC_PROG_SERVICE_REV_PRIOR', 'F9_01_PC_REV_LESS_EXP_PRIOR', 'F9_01_PC_TOT_EXP_PRIOR', 'F9_01_PC_TOT_INDIV_VOLUNTEERS', 'F9_01_PC_TOT_REVENUE_PRIOR', 'F9_01_PC_TOT_UBI_NET', 'F9_01_PZ_SALARIES_PRIOR', 'F9_01_PZ_TOT_ASSETS_BOY', 'F9_01_

In [33]:
print(len(no_fill_cols))

50


In [34]:
print(len(missingcols))
print(len(set(missingcols) - set(no_fill_cols)))
print(len(set(no_fill_cols) - set(missingcols)))

187
149
12


In [35]:
set(no_fill_cols) - set(missingcols)

{'EIN',
 'F9_00_HD_BUILD_TIME_STAMP',
 'F9_00_HD_SIGNING_OFFICER_SIGNTR',
 'F9_00_HD_TAX_PER_BEGIN',
 'F9_00_HD_TAX_PER_END',
 'F9_00_HD_TAX_YEAR',
 'F9_00_HD_TIME_STAMP',
 'F9_00_HD_TIME_STAMP_yr',
 'F9_01_PZ_ORGANIZATIONAL_MISSION',
 'F9_03_PC_PROG_SVC_ACC_1_DESC',
 'URL',
 '_id'}

In [36]:
set(missingcols).intersection(set(no_fill_cols))

{'BusinessName',
 'BusinessNameControlTxt',
 'DLN',
 'F9_00_HD_CTRY_OF_DOMICILE',
 'F9_00_HD_EXEMPT_STATUS_501C',
 'F9_00_HD_FILER_ADDR_US_L1',
 'F9_00_HD_FILER_CITY_US',
 'F9_00_HD_FILER_COUNTRY_FRGN',
 'F9_00_HD_FILER_STATE_US',
 'F9_00_HD_FILER_ZIP_US',
 'F9_00_HD_GROSS_EXEMPT_NUM',
 'F9_00_HD_PRIN_OFF_NAME',
 'F9_00_HD_SPECIAL_CONDITION_DESC',
 'F9_00_HD_STATE_OF_DOMICILE',
 'F9_00_HD_TYPE_ORG_OTHER_DESC',
 'F9_00_HD_WEBSITE',
 'F9_00_HD_YEAR_FORMED',
 'F9_03_PC_PROG_SVC_ACC_1_CODE',
 'F9_03_PC_PROG_SVC_ACC_2_CODE',
 'F9_03_PC_PROG_SVC_ACC_2_DESC',
 'F9_03_PC_PROG_SVC_ACC_3_CODE',
 'F9_03_PC_PROG_SVC_ACC_3_DESC',
 'F9_03_PZ_MISSION_DESCRIPTION',
 'F9_06_PC_STATES_WHERE_RET_FILED',
 'F9_09_EXP_OTH_OTH_TOT',
 'F9_12_PC_ACCTG_METHOD_OTHER',
 'ForeignAddress',
 'ForeignPhoneNum',
 'InCareOfName',
 'InCareOfNm',
 'Name',
 'NameControl',
 'OrganizationName',
 'Phone',
 'PhoneNum',
 'USAddress',
 'fiscal_year',
 'number_of_other_prog_svces'}

In [37]:
missingcols = list(set(missingcols) - set(no_fill_cols))
print(len(missingcols))

149


In [38]:
set(missingcols).intersection(set(no_fill_cols))

set()

<br>Descriptives for numeric rows not missing data and rows not going to be filled

In [49]:
%%time
df[[c for c in df.columns.tolist() if c not in missingcols]].describe(percentiles=[]).T

CPU times: total: 17.2 s
Wall time: 17.4 s


Unnamed: 0,count,mean,std,min,50%,max
F9_00_HD_ADDR_CHANGE,3469008,0,0,0,0,1
F9_00_HD_AMENDED_RETURN,3469008,0,0,0,0,1
F9_00_HD_EXEMPT_STATUS_4847A1,3469008,0,0,0,0,1
F9_00_HD_EXEMPT_STATUS_501C,857105,7,4,2,6,29
F9_00_HD_EXEMPT_STATUS_501C3,3469008,1,0,0,1,1
...,...,...,...,...,...,...
F9_12_PC_FED_GRNT_AUDIT_REQUIRED,3469008,0,0,0,0,1
F9_12_PC_FINCL_STMTS_AUDITED,3469008,0,0,0,0,1
F9_12_SCHED_O_X,3469008,0,0,0,0,1
501c3,3469008,1,0,0,1,1


<br>Descriptives for rows missing data

In [50]:
print(len(missingcols))

147


In [None]:
#%%time
#df[missingcols[:30]].describe(percentiles=[]).T

#### Show all columns
If the output gets truncated due to display limits (e.g., in Jupyter), you can change the Pandas display options temporarily to show all rows in the `.describe().T` output.

In [52]:
%%time
with pd.option_context('display.max_rows', None):  # Show all rows in output
    display(df[missingcols].describe(percentiles=[]).T)

Unnamed: 0,count,mean,std,min,50%,max
F9_03_PC_PROG_SVC_ACC_1_GRNT,941878,2174659,33258716,-68392350,49600,7634854046
F9_09_EXP_GRANT_FRGN_TOT,888420,434373,20638785,-639368,0,8362101007
F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,1663329,53004,618102,-9350949,1261,230532317
F9_10_PC_INVEST_PROG_RELTD_EOY,750463,3020107,83183365,-200946565,0,15653084036
F9_09_PC_PENSION_CONT_TOTAL,1432996,357800,5633753,-69839423,3212,1651163807
F9_10_ASSETS_LOANS_DISQUAL_EOY,692432,24388,7781699,-953698,0,4629461948
F9_08_PC_NONCASH_CONTRIBUTIONS,588250,1490569,39428405,-3490596,31200,9855438190
F9_08_PC_FUNDRAISING_GROSS_INC,948700,95284,2794647,-1023893,29710,2690940620
F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,2369278,6065815,81554705,-1477763000,221554,26027576282
F9_03_PC_PROG_SVC_ACC_2_EXP,882293,15437197,11829136144,-31035418,126732,11111111111111


CPU times: total: 37.1 s
Wall time: 38 s


#### Frequencies for data type

In [44]:
%%time
df[missingcols].dtypes.value_counts()

CPU times: total: 2.16 s
Wall time: 2.18 s


Int64     147
object      2
Name: count, dtype: int64

In [45]:
df[missingcols].dtypes[df[missingcols].dtypes == 'object']

TaxPeriod                    object
F9_00_HD_FILER_ADDR_US_L2    object
dtype: object

In [46]:
missingcols.remove('TaxPeriod')
missingcols.remove('F9_00_HD_FILER_ADDR_US_L2')
df[missingcols].dtypes[df[missingcols].dtypes == 'object']

Series([], dtype: object)

In [47]:
%%time
df[missingcols].dtypes.value_counts()

CPU times: total: 2.06 s
Wall time: 2.14 s


Int64    147
Name: count, dtype: int64

### Write function to fill missing values and then loop over the variables in *missingcol*

In [48]:
#def fillnull(var):
#    #print(df[var].value_counts().to_frame().head(), '\n')
#    print('# of missing observations in %s before processing:' % var, len(df[df[var].isnull()]))
#    
#    df[var] = np.where(df[var].isnull(), 0, df[var])
#    
#    #print(df[var].value_counts().to_frame().head(), '\n')
#    print('# of missing observations in %s after processing:' % var, len(df[df[var].isnull()]), '\n')
#    return df.sample(5)[['URL', var]]
#    #print(df[[newvar, var1, var2, 'ObjectId']][:5], '\n\n\n')

In [49]:
#%%time
#for c in missingcols[:]:
#   fillnull(c)

# of missing observations in F9_10_PC_OTHER_LIABILITIES_EOY before processing: 524852
# of missing observations in F9_10_PC_OTHER_LIABILITIES_EOY after processing: 0 

# of missing observations in F9_01_PC_TERMINATION_CONTRACTION before processing: 884985
# of missing observations in F9_01_PC_TERMINATION_CONTRACTION after processing: 0 

# of missing observations in F9_09_PC_OTHER_SALARY_TOTAL before processing: 288447
# of missing observations in F9_09_PC_OTHER_SALARY_TOTAL after processing: 0 

# of missing observations in F9_09_PC_PENSION_CONT_PROG_SVCE before processing: 725236
# of missing observations in F9_09_PC_PENSION_CONT_PROG_SVCE after processing: 0 

# of missing observations in F9_01_PC_REV_LESS_EXP_PRIOR before processing: 36192
# of missing observations in F9_01_PC_REV_LESS_EXP_PRIOR after processing: 0 

# of missing observations in F9_09_PC_COMP_OFFICERS_FUNDRAISE before processing: 715338
# of missing observations in F9_09_PC_COMP_OFFICERS_FUNDRAISE after processing:

# of missing observations in F9_10_PC_SECURED_MORTGAGES_EOY before processing: 689359
# of missing observations in F9_10_PC_SECURED_MORTGAGES_EOY after processing: 0 

# of missing observations in F9_00_HD_FILER_ADDR_US_L2 before processing: 891980
# of missing observations in F9_00_HD_FILER_ADDR_US_L2 after processing: 0 

# of missing observations in F9_06_PC_CEO_COMPENSTN_PROCESS before processing: 2903
# of missing observations in F9_06_PC_CEO_COMPENSTN_PROCESS after processing: 0 

# of missing observations in F9_06_PC_POLICIES_GOVERN_CHAPTER before processing: 848486
# of missing observations in F9_06_PC_POLICIES_GOVERN_CHAPTER after processing: 0 

# of missing observations in F9_01_PC_NET_ASSETS_BOY before processing: 18184
# of missing observations in F9_01_PC_NET_ASSETS_BOY after processing: 0 

# of missing observations in F9_00_HD_TYPE_ORG_CORP before processing: 102436
# of missing observations in F9_00_HD_TYPE_ORG_CORP after processing: 0 

# of missing observations in F9

# of missing observations in F9_08_PC_ALL_OTHER_CONTRIBUTIONS before processing: 280564
# of missing observations in F9_08_PC_ALL_OTHER_CONTRIBUTIONS after processing: 0 

# of missing observations in F9_09_EXP_GRANT_INDIV_DMSTC_TOT before processing: 599140
# of missing observations in F9_09_EXP_GRANT_INDIV_DMSTC_TOT after processing: 0 

# of missing observations in F9_01_PZ_TOT_ASSETS_BOY before processing: 17031
# of missing observations in F9_01_PZ_TOT_ASSETS_BOY after processing: 0 

# of missing observations in F9_09_PC_TOTAL_MGMT_EXPENSE before processing: 58680
# of missing observations in F9_09_PC_TOTAL_MGMT_EXPENSE after processing: 0 

# of missing observations in F9_09_PC_TOTAL_FUNDRAISE_EXPENSE before processing: 62288
# of missing observations in F9_09_PC_TOTAL_FUNDRAISE_EXPENSE after processing: 0 

# of missing observations in F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN before processing: 313184
# of missing observations in F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN after processing: 0 

#

# of missing observations in F9_09_EXP_SCHED_O_X after processing: 0 

# of missing observations in F9_09_PC_TOTAL_PROG_SVCE_EXPENSE before processing: 56394
# of missing observations in F9_09_PC_TOTAL_PROG_SVCE_EXPENSE after processing: 0 

# of missing observations in F9_00_HD_INITIAL_RETURN before processing: 876529
# of missing observations in F9_00_HD_INITIAL_RETURN after processing: 0 

# of missing observations in F9_11_PC_RECNCLTN_UNRLZD_GAIN before processing: 632402
# of missing observations in F9_11_PC_RECNCLTN_UNRLZD_GAIN after processing: 0 

# of missing observations in F9_12_PC_AUDIT_COMMITTEE before processing: 472342
# of missing observations in F9_12_PC_AUDIT_COMMITTEE after processing: 0 

# of missing observations in F9_08_PC_NONCASH_CONTRIBUTIONS before processing: 736870
# of missing observations in F9_08_PC_NONCASH_CONTRIBUTIONS after processing: 0 

# of missing observations in F9_09_PC_COMP_OFFICERS_TOTAL before processing: 393229
# of missing observations in F

#### New one-liner
You can replace that entire loop with a vectorized one-liner to fill all missing values in the `Int64` columns with 0, which will be much faster:

```python
%%time
df[missingcols] = df[missingcols].fillna(0)
```

This maintains the `Int64` (nullable integer) data type and efficiently fills missing values with 0 across all 147 columns.

If you still want a quick before/after check on missing values per column, I can do:

In [53]:
%%time                                                                       
before = df[missingcols].isnull().sum().sum()
df[missingcols] = df[missingcols].fillna(0)
after = df[missingcols].isnull().sum().sum()
print(f"# of missing values before: {before}")
print(f"# of missing values after:  {after}")                                         

# of missing values before: 294764070
# of missing values after:  0
CPU times: total: 17.9 s
Wall time: 18.3 s


In [54]:
%%time
with pd.option_context('display.max_rows', None):  # Show all rows in output
    display(df[missingcols].describe(percentiles=[]).T)

Unnamed: 0,count,mean,std,min,50%,max
F9_03_PC_PROG_SVC_ACC_1_GRNT,3469008,590447,17357023,-68392350,0,7634854046
F9_09_EXP_GRANT_FRGN_TOT,3469008,111244,10446288,-639368,0,8362101007
F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,3469008,25414,428821,-9350949,0,230532317
F9_10_PC_INVEST_PROG_RELTD_EOY,3469008,653351,38709935,-200946565,0,15653084036
F9_09_PC_PENSION_CONT_TOTAL,3469008,147802,3625192,-69839423,0,1651163807
F9_10_ASSETS_LOANS_DISQUAL_EOY,3469008,4868,3476657,-953698,0,4629461948
F9_08_PC_NONCASH_CONTRIBUTIONS,3469008,252760,16245938,-3490596,0,9855438190
F9_08_PC_FUNDRAISING_GROSS_INC,3469008,26058,1462082,-1023893,0,2690940620
F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,3469008,4142856,67458214,-1477763000,38807,26027576282
F9_03_PC_PROG_SVC_ACC_2_EXP,3469008,3926232,5965637891,-31035418,0,11111111111111


CPU times: total: 31.5 s
Wall time: 32.3 s


#### Fix *F9_00_HD_SPECIAL_CONDITION_DESC*
This variable contains *lists* of text. Combine them into one text string.

In [51]:
#df['F9_00_HD_SPECIAL_CONDITION_DESC__SAFE'] = df['F9_00_HD_SPECIAL_CONDITION_DESC']

In [55]:
print(len(df[df['F9_00_HD_SPECIAL_CONDITION_DESC'].notnull()]))

2948


In [56]:
for index, row in df[df['F9_00_HD_SPECIAL_CONDITION_DESC'].notnull()][50:60].iterrows():
    print(type(row['F9_00_HD_SPECIAL_CONDITION_DESC']), row['F9_00_HD_SPECIAL_CONDITION_DESC'])

<class 'str'> ADMEDED TO ADD SCH R PT. IV - CONTR. EMPLYR INFO
<class 'str'> FORM 990, PART XI, LINE 9 NET EXP EL CASCO LLC
<class 'str'> huricane sandy
<class 'str'> EXTENDED TO 2152013
<class 'str'> HURRICANE SANDY DISASTER RELEIF AREA
<class 'str'> MA RETURN IS DUE JAN 15
<class 'str'> HURRICANE SANDY RELIEF IR 201283 DUE DATE 2113
<class 'str'> HURRICANE SANDY
<class 'str'> HURRICANE SANDY
<class 'str'> HURRICANE SANDY


In [57]:
%%time
df['F9_00_HD_SPECIAL_CONDITION_DESC'] = df['F9_00_HD_SPECIAL_CONDITION_DESC'].apply(lambda x: ' '.join(x) if type(x)==list else x)

CPU times: total: 703 ms
Wall time: 866 ms


In [58]:
df['F9_00_HD_SPECIAL_CONDITION_DESC'].value_counts().head()

F9_00_HD_SPECIAL_CONDITION_DESC
EXTENDED TO NOVEMBER 15 2024    39
EXTENDED TO NOVEMBER 15 2023    35
COVID19                         32
CHANGE IN ACCOUNTING PERIOD     32
PUBLIC DISCLOSURE COPY          29
Name: count, dtype: int64

In [59]:
for index, row in df[df['F9_00_HD_SPECIAL_CONDITION_DESC'].notnull()][50:60].iterrows():
    print(type(row['F9_00_HD_SPECIAL_CONDITION_DESC']), row['F9_00_HD_SPECIAL_CONDITION_DESC'])

<class 'str'> ADMEDED TO ADD SCH R PT. IV - CONTR. EMPLYR INFO
<class 'str'> FORM 990, PART XI, LINE 9 NET EXP EL CASCO LLC
<class 'str'> huricane sandy
<class 'str'> EXTENDED TO 2152013
<class 'str'> HURRICANE SANDY DISASTER RELEIF AREA
<class 'str'> MA RETURN IS DUE JAN 15
<class 'str'> HURRICANE SANDY RELIEF IR 201283 DUE DATE 2113
<class 'str'> HURRICANE SANDY
<class 'str'> HURRICANE SANDY
<class 'str'> HURRICANE SANDY


In [66]:
#df = df.drop('F9_00_HD_SPECIAL_CONDITION_DESC__SAFE', axis=1)

#### Save DF

In [60]:
%%time
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df.to_feather('D:/all_filings_april_2025_all_controls_combined_parsed_type_fillnull.feather')

Current date and time :  2025-04-18 22:44:21 

CPU times: total: 55.8 s
Wall time: 43.8 s


In [63]:
%%time
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df.to_parquet("D:/all_filings_april_2025_all_controls_combined_parsed_type_fillnull.parquet", engine="pyarrow", compression="snappy", index=False)

Current date and time :  2025-04-18 22:46:01 

CPU times: total: 1min 36s
Wall time: 1min 36s


In [64]:
%%time
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df.to_pickle('all_filings_april_2025_all_controls_combined_parsed_type_fillnull.pkl.gz', compression='gzip')

Current date and time :  2025-04-19 15:57:31 

CPU times: total: 51min 48s
Wall time: 52min 56s


In [61]:
#%%time
#import datetime
#print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
#df.to_pickle('all NEW filings February 2024 - all control variables (with parsed sub-key variables and reformatted types and fillnull).pkl.gz',
#            compression='gzip')

Current date and time :  2024-03-31 00:51:07 

CPU times: total: 13min 5s
Wall time: 13min 33s
