In this notebook I parse all of the 'dictionary' columns. I then change the data type for relevant variables to *int* or *float*. I also generated a new variable: *F9_12_PC_ACCTG_METHOD_OTHER__description*

I created versions of the data with the null values filled with zeros. 

Files saved *without* null values filled:
- *all filings - with 185 newly named control variables (with parsed sub-key variables).pkl*
- *all filings - with 185 newly named control variables (with parsed sub-key variables).csv*
- *all filings - with 185 newly named control variables (with parsed sub-key variables).dta*
    - This version of the Stata file excludes 5 problem columns (e.g., contains list, etc.)

Files saved *with* null values filled:
- *all filings - with 185 newly named control variables (with parsed sub-key variables and fillnull).pkl*
- *all filings - with 185 newly named control variables (with parsed sub-key variables and fillnull)).csv*
- *all filings - with 185 newly named control variables (with parsed sub-key variables and fillnull)).dta*
     - This version of the Stata file contains all variables
     
Note:
- The following four variables have names that are > 32 characters and thus had variable names changed by Stata:
    - 'F9_01_PC_PROF_FUNDRISING_EXP_CURR'   ->   F9_01_PC_PROF_FUNDRISING_EXP_CUR
    - 'F9_01_PC_PROF_FUNDRISING_EXP_PRIOR'   ->   F9_01_PC_PROF_FUNDRISING_EXP_PRI
    - 'F9_06_PC_MONITORING_OF_COI_POLICY'   ->   F9_06_PC_MONITORING_OF_COI_POLIC
    - 'F9_12_PC_ACCTG_METHOD_OTHER__description'   ->   F9_12_PC_ACCTG_METHOD_OTHER__des

# Load Packages and Connect to MongoDB

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

In [2]:
print(pd.__version__)

1.1.5


In [3]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)

#### Set working directory

In [4]:
cd "C:\\Users\\Gregory\\IRS 990 Control Variables\\"

C:\Users\Gregory\IRS 990 Control Variables


# Read in Concordance File
We are going to read in two codebooks. First, there is the 'concordance' file. Specifically, before re-arranging and renaming variables, we will read in the relevant section from the *master concordance* file, and then use this file to identify the relevant 'compensation' variables. In a following notebook, we will be using the *new_variable_name* field as our variable name.

In [53]:
concordance = pd.read_excel('concordance_VERIFIED.xlsx')
print('# of columns:', len(concordance.columns))
print('# of observations:', len(concordance))
concordance[:2]

# of columns: 16
# of observations: 384


Unnamed: 0,xpath,project,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,python_data_type,BINARIZE,MongoDB_Name,sub_key,sub_sub_key
0,/Return/ReturnHeader/TaxPeriodEndDate,,F9_00_HD_TAX_PER_END,,,Will be nested under ReturnHeader,,Tax period end date,HEADER,HD,DateType,,,TaxPeriodEndDate,,
1,/Return/ReturnHeader/TaxPeriodEndDt,,F9_00_HD_TAX_PER_END,,,Will be nested under ReturnHeader,,Tax period end date,HEADER,HD,DateType,,,TaxPeriodEndDt,,


In [12]:
concordance[concordance['sub_key'].notnull()][['variable_name_new', 'MongoDB_Name', 'sub_key']]

Unnamed: 0,variable_name_new,MongoDB_Name,sub_key
5,F9_00_HD_SIGNING_OFFICER_SIGNTR,BusinessOfficerGrp,SignatureDt
6,F9_00_HD_SIGNING_OFFICER_SIGNTR,Officer,DateSigned
9,F9_00_HD_FILER_STATE_US,Filer,USAddress
10,F9_00_HD_FILER_STATE_US,Filer,USAddress
242,F9_09_PC_COMP_OFFICERS_TOTAL,CompCurrentOfcrDirectorsGrp,TotalAmt
...,...,...,...
373,F9_10_PC_CASH_NON_INTEREST_EOY,CashNonInterestBearingGrp,EOYAmt
374,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,SavingsAndTempCashInvestments,BOY
375,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,SavingsAndTempCashInvstGrp,BOYAmt
376,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,SavingsAndTempCashInvestments,EOY


In [13]:
subkeycols = list(set(concordance[concordance['sub_key'].notnull()]['variable_name_new'].tolist()))
print(len(subkeycols))
subkeycols

50


['F9_08_PC_TOTAL_REVENUE',
 'F9_09_PC_COMP_OFFICERS_TOTAL',
 'F9_09_PC_OTHER_SALARY_PROG_SVCE',
 'F9_09_PC_COMP_OFFICERS_MGMT',
 'F9_09_PC_OTHER_EMP_BEN_FUNDRAISE',
 'F9_09_PC_FEES_FOR_SVCE_MGMT_TOT',
 'F9_00_HD_FILER_STATE_US',
 'F9_09_PC_PENSION_CONT_PROG_SVCE',
 'F9_09_PC_OTHER_EMP_BEN_MGMT',
 'F9_09_PC_PAYROLL_TAX_PROG_SVCE',
 'F9_09_PC_PAYROLL_TAX_MGMT',
 'F9_10_PC_LOANS_FROM_OFFICERS_EOY',
 'F9_09_PC_TOTAL_MGMT_EXPENSE',
 'F9_09_PC_TOTAL_FUNC_EXPENSES',
 'F9_10_PC_RET_EARNINGS_ENDWMT_EOY',
 'F9_10_PC_CASH_NON_INTEREST_BOY',
 'F9_09_PC_FEES_FOR_SVCE_INVST_TOT',
 'F9_10_PC_CASH_NON_INTEREST_EOY',
 'F9_10_PC_SAVINGS_TEMP_INVEST_BOY',
 'F9_09_PC_FEES_FOR_SVCE_FR_TOT',
 'F9_10_PC_UNSECURED_NOTES_EOY',
 'F9_09_PC_OTHER_SALARY_TOTAL',
 'F9_09_PC_PAYROLL_TAX_TOTAL',
 'F9_09_PC_OTHER_SALARY_FUNDRAISE',
 'F9_09_PC_PENSION_CONT_FUNDRAISE',
 'F9_09_PC_TOTAL_PROG_SVCE_EXPENSE',
 'F9_10_PC_OTHER_LIABILITIES_EOY',
 'F9_00_HD_SIGNING_OFFICER_SIGNTR',
 'F9_09_PC_COMP_DISQUAL_FUNDRAISE',
 'F9_09_P

# Read 990 DB into PANDAS DF
We can modify the above code block to read all filings into a PANDAS dataframe.

In [87]:
%%time
#df = pd.read_pickle('all filings - with 185 newly named control variables.pkl')
df = pd.read_pickle('all filings nov. 2020 - all control variables (renamed).pkl.gz', compression='gzip')
print('# of columns:', len(df.columns))
print('# of observations:', len(df))
df[:2]

# of columns: 199
# of observations: 1895016
Wall time: 3min 41s


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,F9_09_PC_FEES_FOR_SVCE_FR_TOT,fiscal_year,Filer,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,501c3
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,,2010,"{'EIN': '232705170', 'Name': {'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}, 'NameControl': 'RONA', 'Phone': '8565826843', 'USAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300...",2016-02-24 21:20:13Z,1.0,,,,,1,,,1473903,0,,,MICHAEL ANTON,"{'Name': 'ROBERT TRAA', 'Title': 'TREASURER', 'Phone': '8565826843', 'DateSigned': '2011-11-04', 'AuthorizeThirdParty': '1'}",,PA,2010-12-31,2010,2011-11-09T06:41:09-06:00,,1,,,,,1992,0.0,1439340,1044925.0,638637.0,10,30447,1753405,243131,0.0,0,0.0,0,89152,193604,,2440859,881768,195892,0,0.0,450430,1075372,0,0.0,10,0,925000,33563,1990429,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,459751,1000,0,0,0,1925215,1384751,171810,1473903,0,0,0,1,1,0,0,1,0,0,0,0,0,0,,1,0,,0,0,0,1,1,1,10,10,0,0,,,,"[PA, NJ, DE]",0,0,0,1.0,0,0,0,0,0.0,0,1439340.0,,,,,,,,,,,,,,,1439340,1000,,"{'TotalRevenueColumn': '1473903', 'RelatedOrExemptFunctionIncome': '1000', 'UnrelatedBusinessRevenue': '0', 'ExclusionAmount': '33563'}",,,,,,,,,"{'Total': '21675', 'ManagementAndGeneral': '21675'}",,"{'Total': '215', 'ManagementAndGeneral': '215'}",,,,,,,,,,,,,,,,,,,,"{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}","{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}","{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}","{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}",,,,256845,86228,,1,,"{'BOY': '51640', 'EOY': '240077'}",,"{'BOY': '332660', 'EOY': '270700'}","{'BOY': '332660', 'EOY': '270700'}",,,,"{'BOY': '1925215', 'EOY': '2440859'}",,,,89152,,0,1,,,1,,0,1,1
1,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,201106,581805618,{'Total': '0'},2011,"{'EIN': '581805618', 'Name': {'BusinessNameLine1': 'TORRINGTON VOA ELDERLY HOUSING INC', 'BusinessNameLine2': 'BELL PARK TOWER'}, 'NameControl': 'TORR', 'Phone': '7033415000', 'USAddress': {'AddressLine1': '1660 DUKE STREET', 'City': 'ALEXANDRIA'...",2016-02-24 21:20:13Z,,,,,,1,,1736.0,266420,0,0.0,,,"{'Name': 'THOMAS D TURNBULL', 'Title': 'ASST. SEC/TREAS', 'DateSigned': '2011-11-09'}",,WY,2011-06-30,2010,2011-11-09T07:32:06-08:00,,1,,,,,1993,,0,,,13,1425,1437850,189785,,0,,222839,-39085,-36926,,1433342,261190,0,0,,34577,224264,0,,19,0,0,828,1398765,PROVIDE HOUSING FOR THE ELDERLY AND THE DISABLED UNDER SECTION 202 OF THE NATIONAL HOUSING ACT UNDER AN AGREEMENT WITH THE DEPARTMENT OF HUD.,222550,0,265592,82955,71405,1455332,305505,17482,266420,0,0,0,1,1,1,0,1,1,1,1,0,1,1,,1,0,0.0,0,0,0,1,1,1,13,19,0,1,,,0.0,,0,0,1,,0,0,1,411648,,1180355,,,,,,,,,,,,,,265592.0,,0,0,265592.0,"{'TotalRevenueColumn': '266420', 'RelatedOrExemptFunctionIncome': '266420'}",{'Total': '0'},{'Total': '0'},{'Total': '0'},{'Total': '0'},{'Total': '0'},{'Total': '0'},{'Total': '0'},{'Total': '0'},"{'Total': '7500', 'ManagementAndGeneral': '7500'}",{'Total': '0'},{'Total': '0'},{'Total': '0'},"{'Total': '21600', 'ManagementAndGeneral': '21600'}",{'Total': '0'},"{'Total': '17714', 'ProgramServices': '17714'}","{'Total': '17714', 'ProgramServices': '17714'}","{'Total': '17714', 'ProgramServices': '17714'}","{'Total': '17714', 'ProgramServices': '17714'}","{'Total': '59440', 'ProgramServices': '59440'}","{'Total': '59440', 'ProgramServices': '59440'}","{'Total': '59440', 'ProgramServices': '59440'}","{'Total': '59440', 'ProgramServices': '59440'}","{'Total': '5801', 'ProgramServices': '5801'}","{'Total': '5801', 'ProgramServices': '5801'}","{'Total': '5801', 'ProgramServices': '5801'}","{'Total': '5801', 'ProgramServices': '5801'}",{'Total': '0'},{'Total': '0'},{'Total': '0'},{'Total': '0'},"{'Total': '305505', 'ProgramServices': '276405', 'ManagementAndGeneral': '29100', 'Fundraising': '0'}","{'Total': '305505', 'ProgramServices': '276405', 'ManagementAndGeneral': '29100', 'Fundraising': '0'}","{'Total': '305505', 'ProgramServices': '276405', 'ManagementAndGeneral': '29100', 'Fundraising': '0'}","{'Total': '305505', 'ProgramServices': '276405', 'ManagementAndGeneral': '29100', 'Fundraising': '0'}",,"{'BOY': '250', 'EOY': '22261'}","{'BOY': '250', 'EOY': '22261'}",2187206,904332,,1,,"{'BOY': '9203', 'EOY': '11349'}",,{'EOY': '0'},{'EOY': '0'},"{'BOY': '6219', 'EOY': '7035'}",,,"{'BOY': '1455332', 'EOY': '1433342'}",,,,-39085,,0,1,,,1,1.0,1,1,1


In [9]:
print(df.columns.tolist())

['OrganizationName', 'URL', 'DLN', 'TaxPeriod', 'EIN', 'F9_09_PC_FEES_FOR_SVCE_FR_TOT', 'fiscal_year', 'Filer', 'F9_00_HD_BUILD_TIME_STAMP', 'F9_00_HD_ADDR_CHANGE', 'F9_00_HD_AMENDED_RETURN', 'F9_00_HD_CTRY_OF_DOMICILE', 'F9_00_HD_EXEMPT_STATUS_4847A1', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_EXEMPT_STATUS_501C3', 'F9_00_HD_FINAL_RETURN', 'F9_00_HD_GROSS_EXEMPT_NUM', 'F9_00_HD_GROSS_RCPT', 'F9_00_HD_GROUP_RETURN', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 'F9_00_HD_INITIAL_RETURN', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_SIGNING_OFFICER_SIGNTR', 'F9_00_HD_SPECIAL_CONDITION_DESC', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TAX_PER_END', 'F9_00_HD_TAX_YEAR', 'F9_00_HD_TIME_STAMP', 'F9_00_HD_TYPE_ORG_ASSOCIATION', 'F9_00_HD_TYPE_ORG_CORP', 'F9_00_HD_TYPE_ORG_OTHER', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_TYPE_ORG_TRUST', 'F9_00_HD_WEBSITE', 'F9_00_HD_YEAR_FORMED', 'F9_01_PC_BEN_PAID_MEMB_PRIOR', 'F9_01_PC_CONTR_GRANTS_CURR', 'F9_01_PC_CONTR_GRANTS_PRIOR', 'F9_01_PC_GRANTS_PRIOR', 'F9_01_PC_INDEP_VOTI

# CODE TO FLATTEN DICTIONARY

# Combine Variables

In [55]:
df[['F9_00_HD_BUILD_TIME_STAMP' ,'F9_00_HD_TIME_STAMP', 'F9_00_HD_TAX_YEAR', 'TaxPeriod']].sample(5)

Unnamed: 0,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_TIME_STAMP,F9_00_HD_TAX_YEAR,TaxPeriod
1366722,2018-06-14 16:35:46Z,2018-05-04T01:34:39-00:00,2017,201712
230496,2016-02-24 21:20:13Z,2012-07-05T09:48:32-05:00,2011,201112
1843389,2020-04-17 16:48:07Z,2019-11-15T14:49:39-06:00,2018,201812
1752918,2020-03-31 21:24:44Z,2019-11-14T09:07:57-08:00,2018,201812
380890,2016-02-24 21:20:13Z,2013-08-01T09:24:10-00:00,2012,201212


In [88]:
def agg_funcs(x):
    names = {
        #'name': x['variable_name_new'].head(1).values[0],
        'original_names':  list(set(x['MongoDB_Name'].tolist())),
        'sub_keys':  list(set(x['sub_key'].tolist())),
        'data_type_xsd': x['data_type_xsd'].head(1).values[0]
    }
    #THE FOLLOWING SHORTCUT WORKS BUT CHANGES THE ORDER OF THE COLUMNS
    #return pd.Series(names, index = list(names.keys()))
    return pd.Series(names, index=['original_names', 'sub_keys', 'data_type_xsd'])
new_variables_df = concordance[concordance['sub_key'].notnull()][:].groupby(['variable_name_new']).apply(agg_funcs)
new_variables_df = new_variables_df.reset_index()
print('# of variables:', len(new_variables_df))
new_variables_df[:]

# of variables: 50


Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd
0,F9_00_HD_FILER_STATE_US,[Filer],[USAddress],StateType
1,F9_00_HD_SIGNING_OFFICER_SIGNTR,"[BusinessOfficerGrp, Officer]","[DateSigned, SignatureDt]",DateType
2,F9_08_PC_TOTAL_REVENUE,"[TotalRevenue, TotalRevenueGrp]","[TotalRevenueColumnAmt, TotalRevenueColumn]",USAmountType
3,F9_09_PC_COMP_DISQUAL_FUNDRAISE,"[CompDisqualPersonsGrp, CompDisqualPersons]","[FundraisingAmt, Fundraising]",USAmountNNType
4,F9_09_PC_COMP_DISQUAL_MGMT,"[CompDisqualPersonsGrp, CompDisqualPersons]","[ManagementAndGeneralAmt, ManagementAndGeneral]",USAmountNNType
5,F9_09_PC_COMP_DISQUAL_PROG_SVCE,"[CompDisqualPersonsGrp, CompDisqualPersons]","[ProgramServicesAmt, ProgramServices]",USAmountNNType
6,F9_09_PC_COMP_DISQUAL_TOTAL,"[CompDisqualPersonsGrp, CompDisqualPersons]","[TotalAmt, Total]",USAmountNNType
7,F9_09_PC_COMP_OFFICERS_FUNDRAISE,"[CompCurrentOfcrDirectorsGrp, CompCurrentOfficersDirectors]","[FundraisingAmt, Fundraising]",USAmountType
8,F9_09_PC_COMP_OFFICERS_MGMT,"[CompCurrentOfcrDirectorsGrp, CompCurrentOfficersDirectors]","[ManagementAndGeneralAmt, ManagementAndGeneral]",USAmountType
9,F9_09_PC_COMP_OFFICERS_PROG_SVCE,"[CompCurrentOfcrDirectorsGrp, CompCurrentOfficersDirectors]","[ProgramServices, ProgramServicesAmt]",USAmountType


In [89]:
new_variables_df['len'] = new_variables_df['original_names'].apply(lambda x: len(x))
print(new_variables_df['len'].value_counts(), '\n')
new_variables_df

2    48
1     2
Name: len, dtype: int64 



Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len
0,F9_00_HD_FILER_STATE_US,[Filer],[USAddress],StateType,1
1,F9_00_HD_SIGNING_OFFICER_SIGNTR,"[BusinessOfficerGrp, Officer]","[DateSigned, SignatureDt]",DateType,2
2,F9_08_PC_TOTAL_REVENUE,"[TotalRevenue, TotalRevenueGrp]","[TotalRevenueColumnAmt, TotalRevenueColumn]",USAmountType,2
3,F9_09_PC_COMP_DISQUAL_FUNDRAISE,"[CompDisqualPersonsGrp, CompDisqualPersons]","[FundraisingAmt, Fundraising]",USAmountNNType,2
4,F9_09_PC_COMP_DISQUAL_MGMT,"[CompDisqualPersonsGrp, CompDisqualPersons]","[ManagementAndGeneralAmt, ManagementAndGeneral]",USAmountNNType,2
5,F9_09_PC_COMP_DISQUAL_PROG_SVCE,"[CompDisqualPersonsGrp, CompDisqualPersons]","[ProgramServicesAmt, ProgramServices]",USAmountNNType,2
6,F9_09_PC_COMP_DISQUAL_TOTAL,"[CompDisqualPersonsGrp, CompDisqualPersons]","[TotalAmt, Total]",USAmountNNType,2
7,F9_09_PC_COMP_OFFICERS_FUNDRAISE,"[CompCurrentOfcrDirectorsGrp, CompCurrentOfficersDirectors]","[FundraisingAmt, Fundraising]",USAmountType,2
8,F9_09_PC_COMP_OFFICERS_MGMT,"[CompCurrentOfcrDirectorsGrp, CompCurrentOfficersDirectors]","[ManagementAndGeneralAmt, ManagementAndGeneral]",USAmountType,2
9,F9_09_PC_COMP_OFFICERS_PROG_SVCE,"[CompCurrentOfcrDirectorsGrp, CompCurrentOfficersDirectors]","[ProgramServices, ProgramServicesAmt]",USAmountType,2


In [90]:
#for index, row in new_variables_df[new_variables_df['len_subkeys']!=2].iterrows():
#    variable = row['variable_name_new']
#    keys = row['sub_keys']

In [91]:
new_variables_df['len_subkeys'] = new_variables_df['sub_keys'].apply(lambda x: len(x))
print(new_variables_df['len_subkeys'].value_counts(), '\n')
new_variables_df

2    48
1     2
Name: len_subkeys, dtype: int64 



Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,len_subkeys
0,F9_00_HD_FILER_STATE_US,[Filer],[USAddress],StateType,1,1
1,F9_00_HD_SIGNING_OFFICER_SIGNTR,"[BusinessOfficerGrp, Officer]","[DateSigned, SignatureDt]",DateType,2,2
2,F9_08_PC_TOTAL_REVENUE,"[TotalRevenue, TotalRevenueGrp]","[TotalRevenueColumnAmt, TotalRevenueColumn]",USAmountType,2,2
3,F9_09_PC_COMP_DISQUAL_FUNDRAISE,"[CompDisqualPersonsGrp, CompDisqualPersons]","[FundraisingAmt, Fundraising]",USAmountNNType,2,2
4,F9_09_PC_COMP_DISQUAL_MGMT,"[CompDisqualPersonsGrp, CompDisqualPersons]","[ManagementAndGeneralAmt, ManagementAndGeneral]",USAmountNNType,2,2
5,F9_09_PC_COMP_DISQUAL_PROG_SVCE,"[CompDisqualPersonsGrp, CompDisqualPersons]","[ProgramServicesAmt, ProgramServices]",USAmountNNType,2,2
6,F9_09_PC_COMP_DISQUAL_TOTAL,"[CompDisqualPersonsGrp, CompDisqualPersons]","[TotalAmt, Total]",USAmountNNType,2,2
7,F9_09_PC_COMP_OFFICERS_FUNDRAISE,"[CompCurrentOfcrDirectorsGrp, CompCurrentOfficersDirectors]","[FundraisingAmt, Fundraising]",USAmountType,2,2
8,F9_09_PC_COMP_OFFICERS_MGMT,"[CompCurrentOfcrDirectorsGrp, CompCurrentOfficersDirectors]","[ManagementAndGeneralAmt, ManagementAndGeneral]",USAmountType,2,2
9,F9_09_PC_COMP_OFFICERS_PROG_SVCE,"[CompCurrentOfcrDirectorsGrp, CompCurrentOfficersDirectors]","[ProgramServices, ProgramServicesAmt]",USAmountType,2,2


# Extract Key
We now have five variables copied once each for *TOTAL*, *FUNDRAISING*, *MANAGEMENT*, AND *PROGRAM SERVICES*. We can loop over the variable names and sub_keys in *new_variables_df* and extract the desired values.

In [59]:
print(len(new_variables_df))
new_variables_df[:1]

50


Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,len_subkeys
0,F9_00_HD_FILER_STATE_US,[Filer],[USAddress],StateType,1,1


# Parse variables

In [60]:
df.dtypes

OrganizationName                    object
URL                                 object
DLN                                 object
TaxPeriod                           object
EIN                                 object
                                     ...  
F9_12_PC_AUDIT_COMMITTEE            object
F9_12_PC_FED_GRNT_AUDIT_PERFORMD    object
F9_12_PC_FED_GRNT_AUDIT_REQUIRED    object
F9_12_PC_FINCL_STMTS_AUDITED        object
501c3                                int32
Length: 199, dtype: object

### Extended 'lambda' function
https://stackoverflow.com/questions/48872234/using-apply-in-pandas-lambda-functions-with-multiple-if-statements?noredirect=1&lq=1

##### Define function

In [61]:
df.sample(1)

Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,F9_09_PC_FEES_FOR_SVCE_FR_TOT,fiscal_year,Filer,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,501c3
1469021,MEDICAL EDUCATION FOUNDATION ACADEMY,https://s3.amazonaws.com/irs-form-990/201822499349300402_public.xml,93493249004028,201712,237120447,{'TotalAmt': '0'},2017,"{'EIN': '237120447', 'BusinessName': {'BusinessNameLine1Txt': 'Medical Education Foundation Academy'}, 'BusinessNameControlTxt': 'MEDI', 'PhoneNum': '4192269567', 'USAddress': {'AddressLine1Txt': '730 W Market Street', 'CityNm': 'Lima', 'StateAbb...",2018-06-14 16:35:46Z,,,,,,1,,,100833,0,0,,David Neidhardt MD,"{'PersonNm': 'David Neidhardt MD', 'PersonTitleTxt': 'President', 'SignatureDt': '2018-09-06', 'DiscussWithPaidPreparerInd': 'true'}",,OH,2017-12-31,2017,2018-09-06T07:07:04-07:00,,1,,,,,1971,,1498,6061,,5,19496,762480,8316,,0,,,21111,17241,,807030,8316,0,0,,0,25557,0,,5,0,0,29869,807030,Provide loans to medical students,10256,0,0,0,,762480,10256,,31367,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,1,0,,0,0,0,0,1,0,5,5,0,0,,,,OH,0,0,0,1,0,0,0,,,,1498,,,,,,,,,,,,,0,,1498,0,0,"{'TotalRevenueColumnAmt': '31367', 'ExclusionAmt': '29869'}",{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},"{'TotalAmt': '1435', 'ManagementAndGeneralAmt': '1435'}",{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},"{'TotalAmt': '10256', 'ProgramServicesAmt': '8621', 'ManagementAndGeneralAmt': '1635', 'FundraisingAmt': '0'}","{'TotalAmt': '10256', 'ProgramServicesAmt': '8621', 'ManagementAndGeneralAmt': '1635', 'FundraisingAmt': '0'}","{'TotalAmt': '10256', 'ProgramServicesAmt': '8621', 'ManagementAndGeneralAmt': '1635', 'FundraisingAmt': '0'}","{'TotalAmt': '10256', 'ProgramServicesAmt': '8621', 'ManagementAndGeneralAmt': '1635', 'FundraisingAmt': '0'}",,"{'BOYAmt': '8956', 'EOYAmt': '4808'}","{'BOYAmt': '8956', 'EOYAmt': '4808'}",10943,10943,,1,,,,{'EOYAmt': '0'},{'EOYAmt': '0'},,,,"{'BOYAmt': '762480', 'EOYAmt': '807030'}",,,,21111,23439,1,,1,,1,,0,0,1


##### Write Functions

In [93]:
def func_onekey(x, key1):
    if pd.isnull(x):
        return np.nan
    #else: 
    #    mydict = ast.literal_eval(x)
    elif key1 in x.keys():
        return x[key1]
    else:
        return np.nan

In [94]:
def func(x, key1, key2):
    if pd.isnull(x):
        return np.nan
    #else: 
    #    mydict = ast.literal_eval(x)
    elif key1 in x.keys():
        return x[key1]
    elif key2 in x.keys():
        return x[key2]
    else:
        return np.nan

In [95]:
new_variables_df[new_variables_df['len_subkeys']!=2]

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,len_subkeys
0,F9_00_HD_FILER_STATE_US,[Filer],[USAddress],StateType,1,1
41,F9_10_PC_LOANS_FROM_OFFICERS_EOY,"[LoansFromOfficersDirectorsGrp, LoansFromOfficersDirectors]",[EOYAmt],USAmountType,2,1


##### Deal with *Filer* separately -- I only want the state for now

In [96]:
df[:1][['Filer']]

Unnamed: 0,Filer
0,"{'EIN': '232705170', 'Name': {'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}, 'NameControl': 'RONA', 'Phone': '8565826843', 'USAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300..."


In [97]:
%%time
df['F9_00_HD_FILER_STATE_US'] = df['Filer'][:].apply(func_onekey, key1='USAddress')

Wall time: 14.6 s


In [98]:
%%time
df['F9_00_HD_FILER_STATE_US'] = df['F9_00_HD_FILER_STATE_US'][:].apply(func, key1='State', key2='StateAbbreviationCd')

Wall time: 13.9 s


In [99]:
%%time
print(len(df[df['Filer'].notnull()]))
print(len(df[df['F9_00_HD_FILER_STATE_US'].notnull()]))

1895016
1893438
Wall time: 33.4 s


In [100]:
1895016-1893438

1578

<br>Missing values have foreign address

In [118]:
df[(df['Filer'].notnull())&(df['F9_00_HD_FILER_STATE_US'].isnull())][['Filer']].sample(5)

Unnamed: 0,Filer
814179,"{'EIN': '980050753', 'BusinessName': {'BusinessNameLine1Txt': 'Children's Aid Foundation'}, 'InCareOfNm': '% ENZA DIBENEDETTO', 'BusinessNameControlTxt': 'CHIL', 'PhoneNum': '4169230924', 'ForeignAddress': {'AddressLine1Txt': '25 Spadina Road', '..."
229996,"{'EIN': '660498051', 'Name': {'BusinessNameLine1': 'JOHN DEWEY COLLEGE INC'}, 'NameControl': 'JOHN', 'Phone': '7877530039', 'ForeignAddress': {'AddressLine1': 'PO BOX 19538', 'City': 'SAN JUAN', 'ProvinceOrState': 'PR', 'Country': 'RQ', 'PostalCo..."
36494,"{'EIN': '510616171', 'Name': {'BusinessNameLine1': 'DIAMOND DEVELOPMENT INITIATIVE INTERNATIONAL'}, 'NameControl': 'DIAM', 'Phone': '6135650507', 'ForeignAddress': {'AddressLine1': '1 NICHOLAS STREET NO 1516 A', 'City': 'OTTAWA', 'ProvinceOrState..."
35243,"{'EIN': '660428488', 'Name': {'BusinessNameLine1': 'RINCON HEALTH CENTER INC'}, 'NameControl': 'RINC', 'Phone': '7878235555', 'ForeignAddress': {'AddressLine1': 'PO BOX 638', 'City': 'RINCON', 'ProvinceOrState': 'PR', 'Country': 'RQ', 'PostalCode..."
12695,"{'EIN': '133085180', 'Name': {'BusinessNameLine1': 'THE ROYAL SHAKESPEARE company Stratford-upon-Avon'}, 'NameControl': 'ROYA', 'Phone': '1789296655', 'ForeignAddress': {'AddressLine1': 'southern laneSTrATFORD-UPON-AVON', 'City': 'WARWICKSHIRE un..."


<br>Drop *F9_00_HD_FILER_STATE_US* from *new_variables_df*

In [120]:
new_variables_df = new_variables_df.drop(0) 

##### Now do *F9_10_PC_LOANS_FROM_OFFICERS_EOY*

#### Loop and apply 'one key' function

In [123]:
new_variables_df[new_variables_df['len_subkeys']!=2]

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,len_subkeys
41,F9_10_PC_LOANS_FROM_OFFICERS_EOY,"[LoansFromOfficersDirectorsGrp, LoansFromOfficersDirectors]",[EOYAmt],USAmountType,2,1


In [124]:
import timeit
start_time = timeit.default_timer()

for index, row in new_variables_df[new_variables_df['len_subkeys']!=2].iterrows():
    variable = row['variable_name_new']
    keys = row['sub_keys']
    key = keys[0]
    #key2 = keys[1]
    print(variable, key)
    #print(type(row['variable_name_new']))
    #df.loc[df.index[index], row['variable_name_new']] = 
    #df.loc[df.index[45], 'reptrak100-rank-2013 (binary)'] = 0
    
    df[variable] = df[variable][:].apply(func_onekey, key1=key)
    #df[variable] = df[variable].astype('float')
    
    
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60)     

F9_10_PC_LOANS_FROM_OFFICERS_EOY EOYAmt
# of minutes:  0.046058158333350245


In [126]:
df[['F9_10_PC_LOANS_FROM_OFFICERS_EOY']].sample(10)

Unnamed: 0,F9_10_PC_LOANS_FROM_OFFICERS_EOY
1793679,
1433690,
344811,
554515,
1527621,20000.0
1115247,0.0
1591731,0.0
1516172,
231469,
1068906,


In [129]:
df[['F9_10_PC_LOANS_FROM_OFFICERS_EOY']].dtypes

F9_10_PC_LOANS_FROM_OFFICERS_EOY    object
dtype: object

#### Loop and apply main function

##### Sidebar -- deal with one sub-key variable that was missed 
I added the *BOY* version as well.

In [14]:
#df['F9_10_PC_UNSECURED_NOTES_BOY'] = df['F9_10_PC_UNSECURED_NOTES_EOY'] 

In [90]:
"""
import timeit
start_time = timeit.default_timer()

for index, row in new_variables_df[new_variables_df['variable_name_new'].str.contains('F9_10_PC_UNSECURED_NOTES')].iterrows():
    variable = row['variable_name_new']
    keys = row['sub_keys']
    key1 = keys[0]
    key2 = keys[1]
    print(variable, key1, key2)
    #print(type(row['variable_name_new']))
    #df.loc[df.index[index], row['variable_name_new']] = 
    #df.loc[df.index[45], 'reptrak100-rank-2013 (binary)'] = 0
    
    df[variable] = df[variable][:].apply(func, key1=key1, key2=key2)
    df[variable] = df[variable].astype('float')
    
    
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 
"""

F9_10_PC_UNSECURED_NOTES_BOY BOY BOYAmt
F9_10_PC_UNSECURED_NOTES_EOY EOY EOYAmt
# of minutes:  0.17385579333349596


#### Drop *F9_00_HD_SIGNING_OFFICER_SIGNTR* -- it's already dealt with

In [131]:
#df[['F9_00_HD_SIGNING_OFFICER_SIGNTR']].sample(5)

Unnamed: 0,F9_00_HD_SIGNING_OFFICER_SIGNTR
214857,"{'Name': 'Debbi Logan', 'Title': 'Executive Direc', 'DateSigned': '2012-11-08', 'AuthorizeThirdParty': 'true'}"
1225557,"{'PersonNm': 'JENNIFER BARR', 'PersonTitleTxt': 'CO-PRESIDENT ELECT', 'PhoneNum': '3037514355', 'SignatureDt': '2017-08-07', 'DiscussWithPaidPreparerInd': 'true'}"
183663,"{'Name': 'JEFF MONSON', 'Title': 'Executive Direc', 'DateSigned': '2012-08-14', 'AuthorizeThirdParty': 'true'}"
437170,"{'Name': 'STEVE BECKER', 'Title': 'TREASURER', 'Phone': '2086856989', 'DateSigned': '2013-10-17', 'AuthorizeThirdParty': '1'}"
316804,"{'Name': 'ARNOLD WITTE', 'Title': 'TREASURER', 'DateSigned': '2013-05-13', 'AuthorizeThirdParty': 'false'}"


In [88]:
#df[['F9_00_HD_SIGNING_OFFICER_SIGNTR']].describe().T

Unnamed: 0,count,unique,top,freq
F9_00_HD_SIGNING_OFFICER_SIGNTR,1895016,4350,2019-11-15,12645


In [132]:
new_variables_df[:2]

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,len_subkeys
1,F9_00_HD_SIGNING_OFFICER_SIGNTR,"[BusinessOfficerGrp, Officer]","[DateSigned, SignatureDt]",DateType,2,2
2,F9_08_PC_TOTAL_REVENUE,"[TotalRevenue, TotalRevenueGrp]","[TotalRevenueColumnAmt, TotalRevenueColumn]",USAmountType,2,2


In [91]:
#new_variables_df = new_variables_df.drop(1)
#new_variables_df[:2]

Unnamed: 0,variable_name_new,original_names,sub_keys,len,len_subkeys
2,F9_08_PC_TOTAL_REVENUE,"[TotalRevenue, TotalRevenueGrp]","[TotalRevenueColumnAmt, TotalRevenueColumn]",2,2
3,F9_09_PC_COMP_DISQUAL_FUNDRAISE,"[CompDisqualPersons, CompDisqualPersonsGrp]","[Fundraising, FundraisingAmt]",2,2


# 12/6/2020 - I'm commenting out the 'float' command here -- I'll deal with all data type changes in next notebook

In [133]:
import timeit
start_time = timeit.default_timer()

for index, row in new_variables_df[new_variables_df['len_subkeys']==2][:].iterrows():
    variable = row['variable_name_new']
    keys = row['sub_keys']
    key1 = keys[0]
    key2 = keys[1]
    print(variable, key1, key2)
    #print(type(row['variable_name_new']))
    #df.loc[df.index[index], row['variable_name_new']] = 
    #df.loc[df.index[45], 'reptrak100-rank-2013 (binary)'] = 0
    
    df[variable] = df[variable][:].apply(func, key1=key1, key2=key2)
    #df[variable] = df[variable].astype('float')
    
    
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60)     

F9_00_HD_SIGNING_OFFICER_SIGNTR DateSigned SignatureDt
F9_08_PC_TOTAL_REVENUE TotalRevenueColumnAmt TotalRevenueColumn
F9_09_PC_COMP_DISQUAL_FUNDRAISE FundraisingAmt Fundraising
F9_09_PC_COMP_DISQUAL_MGMT ManagementAndGeneralAmt ManagementAndGeneral
F9_09_PC_COMP_DISQUAL_PROG_SVCE ProgramServicesAmt ProgramServices
F9_09_PC_COMP_DISQUAL_TOTAL TotalAmt Total
F9_09_PC_COMP_OFFICERS_FUNDRAISE FundraisingAmt Fundraising
F9_09_PC_COMP_OFFICERS_MGMT ManagementAndGeneralAmt ManagementAndGeneral
F9_09_PC_COMP_OFFICERS_PROG_SVCE ProgramServices ProgramServicesAmt
F9_09_PC_COMP_OFFICERS_TOTAL TotalAmt Total
F9_09_PC_FEES_FOR_SVCE_ACCT_TOT TotalAmt Total
F9_09_PC_FEES_FOR_SVCE_FR_TOT TotalAmt Total
F9_09_PC_FEES_FOR_SVCE_INVST_TOT TotalAmt Total
F9_09_PC_FEES_FOR_SVCE_LEGL_TOT TotalAmt Total
F9_09_PC_FEES_FOR_SVCE_LOBB_TOT TotalAmt Total
F9_09_PC_FEES_FOR_SVCE_MGMT_TOT TotalAmt Total
F9_09_PC_FEES_FOR_SVCE_OTH_TOT TotalAmt Total
F9_09_PC_OTHER_EMP_BEN_FUNDRAISE FundraisingAmt Fundraising
F9_09_PC

In [None]:
new_variables_df[:1]

In [134]:
subkey_vars = new_variables_df[new_variables_df['sub_keys'].notnull()]['variable_name_new'].tolist()
print(len(subkey_vars))

49


In [135]:
print(len(df.columns), len(df))
print(len(df[subkey_vars].columns), len(df))
df[subkey_vars].sample(10)

200 1895016
49 1895016


Unnamed: 0,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_FR_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY
758717,2015-05-04,368867,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,479916,0,0,479916,,22109.0,,,678589.0,820211.0,,,,,,1498800
1350750,2018-04-13,901990,,,,0.0,,,,0.0,25900.0,0.0,0.0,8717.0,0.0,6655.0,198665.0,,,,0.0,,,,0.0,,,,0.0,,,,0.0,672678,13268,175066,484344,0.0,79397.0,82760.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,14495628
1476874,2018-11-15,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,,,,,,0.0,,,,,,0
1526644,2018-11-13,532306,,,,,,,,78795.0,1250.0,,,,,,1450.0,,,,51173.0,,,,59452.0,,,,10713.0,,,,6082.0,525228,0,0,0,,1425.0,3131.0,,,,40000.0,40000.0,,,,43131
852154,2016-01-27,132712,,,,,,,,,5250.0,,,650.0,,,,,,,,,,,,,,,,,,,,152684,6756,7198,138730,,124747.0,100809.0,,,,17650.0,17641.0,,,,118450
263536,2013-08-12,38700,,,,,0.0,11334.0,0.0,11334.0,,,,25.0,,,,,,,,,,,,0.0,3639.0,0.0,3639.0,,,,,34412,0,31217,3195,,72022.0,118994.0,,,,131762.0,92576.0,,,,249932
892446,2016-02-29,220743,,,,,,,,,2251.0,,,851.0,,,,,,,,,,,,,,,,,,,,226745,0,3102,223643,,925.0,925.0,,,466591.0,104780.0,135681.0,,,,487168
1380292,2018-05-24,371710,,,,,,,,,2350.0,,,,,,1635.0,,,,27506.0,,,,167766.0,,,,,,,,,293670,0,0,0,,71326.0,149366.0,,,,,,,,,149366
1652340,2019-07-15,580712,,,,,,,42394.0,42394.0,2053.0,,,,,,,,,-1013.0,-1013.0,,,341829.0,341829.0,,,29403.0,29403.0,,,,,593720,0,0,593720,,12.0,12.0,,5743.0,,34069.0,18264.0,,,,21268
1181529,2017-09-05,3808013,,,,0.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,60570.0,,,,0.0,,185482.0,73924.0,259406.0,,,,0.0,,,,0.0,4123730,0,441833,3681897,,123042.0,224285.0,,643202.0,,,0.0,,,,960997


In [136]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
501c3,1895016.0,0.757498,0.428597,0.0,1.0,1.0,1.0,1.0


### Look at 501(c)(3)s

In [137]:
print('# of columns:', len(df.columns))
print('# of observations:', len(df))

# of columns: 200
# of observations: 1895016


In [138]:
df['501c3'].value_counts()

1    1435470
0     459546
Name: 501c3, dtype: int64

In [139]:
print(len(df[df['501c3']==1]))

1435470


#### Create and save list of EINs for BMF File
In previous round I believe there were only 296,334 EINs (though that may have only been for valid BMF EINs.

In [140]:
ein_list = df[df['501c3']==1]['EIN'].tolist()
print(len(ein_list))
print(len(set(ein_list)))
ein_list = list(set(ein_list))
print(len(ein_list))

1435470
257049
257049


In [11]:
import json
with open('ein_list_501c3.json', 'w') as fp:
    json.dump(ein_list, fp)

#### Save DF

In [141]:
%%time
df.to_pickle('all filings nov. 2020 - all control variables (with parsed sub-key variables).pkl')

Wall time: 2min 49s


In [104]:
#%%time
#df.to_pickle('all filings nov. 2020 - all control variables (with parsed sub-key variables).pkl.gz', compression='gzip')

Wall time: 20min 45s


In [5]:
#%%time
#df = pd.read_pickle('all filings nov. 2020 - all control variables (with parsed sub-key variables).pkl')
#print('# of columns:', len(df.columns))
#print('# of observations:', len(df))
#df[:2]

# of columns: 200
# of observations: 1895016
Wall time: 57.5 s


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,F9_09_PC_FEES_FOR_SVCE_FR_TOT,fiscal_year,Filer,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,501c3,F9_00_HD_FILER_STATE_US
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,,2010,"{'EIN': '232705170', 'Name': {'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}, 'NameControl': 'RONA', 'Phone': '8565826843', 'USAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300...",2016-02-24 21:20:13Z,1.0,,,,,1,,,1473903,0,,,MICHAEL ANTON,2011-11-04,,PA,2010-12-31,2010,2011-11-09T06:41:09-06:00,,1,,,,,1992,0.0,1439340,1044925.0,638637.0,10,30447,1753405,243131,0.0,0,0.0,0,89152,193604,,2440859,881768,195892,0,0.0,450430,1075372,0,0.0,10,0,925000,33563,1990429,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,459751,1000,0,0,0,1925215,1384751,171810,1473903,0,0,0,1,1,0,0,1,0,0,0,0,0,0,,1,0,,0,0,0,1,1,1,10,10,0,0,,,,"[PA, NJ, DE]",0,0,0,1.0,0,0,0,0,0.0,0,1439340.0,,,,,,,,,,,,,,,1439340,1000,,1473903.0,,,,,,,,,21675.0,,215.0,,,,,,,,,,,,,,,,,,,,1384751.0,195892.0,145115.0,1043744.0,,,,256845,86228,,1,,240077.0,,332660.0,270700.0,,,,2440859.0,,,,89152,,0,1,,,1,,0,1,1,PA
1,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,201106,581805618,0.0,2011,"{'EIN': '581805618', 'Name': {'BusinessNameLine1': 'TORRINGTON VOA ELDERLY HOUSING INC', 'BusinessNameLine2': 'BELL PARK TOWER'}, 'NameControl': 'TORR', 'Phone': '7033415000', 'USAddress': {'AddressLine1': '1660 DUKE STREET', 'City': 'ALEXANDRIA'...",2016-02-24 21:20:13Z,,,,,,1,,1736.0,266420,0,0.0,,,2011-11-09,,WY,2011-06-30,2010,2011-11-09T07:32:06-08:00,,1,,,,,1993,,0,,,13,1425,1437850,189785,,0,,222839,-39085,-36926,,1433342,261190,0,0,,34577,224264,0,,19,0,0,828,1398765,PROVIDE HOUSING FOR THE ELDERLY AND THE DISABLED UNDER SECTION 202 OF THE NATIONAL HOUSING ACT UNDER AN AGREEMENT WITH THE DEPARTMENT OF HUD.,222550,0,265592,82955,71405,1455332,305505,17482,266420,0,0,0,1,1,1,0,1,1,1,1,0,1,1,,1,0,0.0,0,0,0,1,1,1,13,19,0,1,,,0.0,,0,0,1,,0,0,1,411648,,1180355,,,,,,,,,,,,,,265592.0,,0,0,265592.0,266420.0,,,,0.0,,,,0.0,7500.0,0.0,0.0,0.0,21600.0,0.0,,,17714.0,17714.0,,,59440.0,59440.0,,,5801.0,5801.0,,,,0.0,305505.0,0.0,29100.0,276405.0,,250.0,22261.0,2187206,904332,,1,,11349.0,,,0.0,7035.0,,,1433342.0,,,,-39085,,0,1,,,1,1.0,1,1,1,VA


# Ended here 12/6/2020

# Change dtypes

In [105]:
df.dtypes

OrganizationName                    object
URL                                 object
DLN                                 object
TaxPeriod                           object
EIN                                 object
                                     ...  
F9_12_PC_FED_GRNT_AUDIT_PERFORMD    object
F9_12_PC_FED_GRNT_AUDIT_REQUIRED    object
F9_12_PC_FINCL_STMTS_AUDITED        object
501c3                                int32
F9_00_HD_FILER_STATE_US             object
Length: 200, dtype: object

In [12]:
string_cols = df.select_dtypes(include='object').columns.tolist()
print(len(string_cols))
print(string_cols, '\n')

102
['OrganizationName', 'URL', 'DLN', 'TaxPeriod', 'EIN', 'fiscal_year', 'Filer', 'F9_00_HD_BUILD_TIME_STAMP', 'F9_00_HD_ADDR_CHANGE', 'F9_00_HD_AMENDED_RETURN', 'F9_00_HD_CTRY_OF_DOMICILE', 'F9_00_HD_EXEMPT_STATUS_4847A1', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_EXEMPT_STATUS_501C3', 'F9_00_HD_FINAL_RETURN', 'F9_00_HD_GROSS_EXEMPT_NUM', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 'F9_00_HD_INITIAL_RETURN', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_SIGNING_OFFICER_SIGNTR', 'F9_00_HD_SPECIAL_CONDITION_DESC', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TAX_PER_END', 'F9_00_HD_TIME_STAMP', 'F9_00_HD_TYPE_ORG_ASSOCIATION', 'F9_00_HD_TYPE_ORG_CORP', 'F9_00_HD_TYPE_ORG_OTHER', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_TYPE_ORG_TRUST', 'F9_00_HD_WEBSITE', 'F9_00_HD_YEAR_FORMED', 'F9_01_PC_BEN_PAID_MEMB_PRIOR', 'F9_01_PC_CONTR_GRANTS_PRIOR', 'F9_01_PC_GRANTS_PRIOR', 'F9_01_PC_INVEST_INCOME_PRIOR', 'F9_01_PC_NET_ASSETS_BOY', 'F9_01_PC_OTHER_EXPENSE_PRIOR', 'F9_01_PC_OTHER_REV_PRIOR', 'F9_01_PC_PROF_FUNDRISING_EXP

In [9]:
[c for c in string_cols if 'Filer' in c]

['Filer']

In [13]:
string_cols.remove('Filer')

In [11]:
[c for c in string_cols if 'Filer' in c]

[]

In [12]:
df[string_cols].describe().T

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas\_libs\hashtable_class_helper.pxi", line 1652, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas\_libs\hashtable_class_helper.pxi", line 1652, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


Unnamed: 0,count,unique,top,freq
OrganizationName,1895015,402753,SHRINERS INTERNATIONAL,486
URL,1895015,1895015,https://s3.amazonaws.com/irs-form-990/201632669349300208_public.xml,1
DLN,1895015,1894870,93493146006000,2
TaxPeriod,1895015,130,201712,138562
EIN,1895015,336694,943041314,18
...,...,...,...,...
F9_12_PC_AUDIT_COMMITTEE,1123853,2,1,904900
F9_12_PC_FED_GRNT_AUDIT_PERFORMD,234488,2,1,181201
F9_12_PC_FED_GRNT_AUDIT_REQUIRED,1718321,2,0,1536437
F9_12_PC_FINCL_STMTS_AUDITED,1895016,2,0,987613


In [13]:
df[string_cols[110:130]].describe().T

Unnamed: 0,count,unique,top,freq
F9_07_PC_TOTAL_COMP_GRTR_150K,1895016,2,0,1502446
F9_07_PC_TOT_OTHER_COMPENSATION,1155231,215325,0,550976
F9_07_PC_TOT_REPRT_COMP_FROM_ORG,1417627,410070,0,439049
F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,1049376,159712,0,820947
F9_08_PC_ALL_OTHER_CONTRIBUTIONS,1252275,585146,0,6465
F9_08_PC_CONTS_REPRTD_FNDRAISNG,315209,157487,0,26586
F9_08_PC_COST_OF_GOODS_SOLD,258222,131280,0,37629
F9_08_PC_FEDERATED_CAMPAIGNS,127613,62899,0,31627
F9_08_PC_FUNDRAISING_DIRECT_EXP,517853,157553,0,47255
F9_08_PC_FUNDRAISING_EVENTS,333970,159941,0,26360


In [52]:
df[df['F9_10_PC_UNSECURED_NOTES_EOY'].notnull()][['F9_10_PC_UNSECURED_NOTES_EOY', 
                                                  'F9_10_PC_UNSECURED_NOTES_BOY']].sample(5)

Unnamed: 0,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PC_UNSECURED_NOTES_BOY
1465201,0.0,0.0
385730,0.0,0.0
888517,2027865.0,1999750.0
789951,1009570.0,1106658.0
1414944,2085073.0,2183412.0


In [26]:
#### First identify all variables in *string_cols* that should be strings
#exclude_cols = ['DLN', 'EIN', 'OrganizationName', 'TaxPeriod', 'URL', 'F9_00_HD_TAX_PER_END',
#                'F9_01_PZ_ORGANIZATIONAL_MISSION',  'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_STATE_OF_DOMICILE', 
#                'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_WEBSITE', 'F9_06_PC_STATES_WHERE_RET_FILED']
#problem_cols = ['F9_00_HD_EXEMPT_STATUS_501C', 'F9_12_PC_ACCTG_METHOD_OTHER', 'F9_00_HD_INCLUDES_SUBORD_ORGS']

In [32]:
#df[df['F9_00_HD_INCLUDES_SUBORD_ORGS'].notnull()][['F9_00_HD_INCLUDES_SUBORD_ORGS']].sample(5)

Unnamed: 0,F9_00_HD_INCLUDES_SUBORD_ORGS
7881,False
6286,False
4683,False
9233,False
2756,False


#### First, make columns that have all zeros integers

In [14]:
integer_cols_feb = ['DLN',
 'EIN',
 'OrganizationName',
 'TaxPeriod',
 'URL',
 'F9_00_HD_TAX_PER_END',
 'F9_00_HD_TAX_YEAR',
 'F9_00_HD_GROSS_RCPT',
 'F9_00_HD_GROUP_RETURN',
 'F9_01_PC_CONTR_GRANTS_CURR',
 'F9_01_PC_INDEP_VOTING_MEMB',
 'F9_01_PC_PROF_FUNDRISING_EXP_CURR',
 'F9_01_PC_REV_LESS_EXP_CURR',
 'F9_01_PC_TOT_ASSETS_EOY',
 'F9_01_PC_TOT_FNDR_EXP_CURR',
 'F9_01_PC_TOT_INDIV_EMPLOYED',
 'F9_01_PC_TOT_LIABILITIES_EOY',
 'F9_01_PC_TOT_UBI_GROSS',
 'F9_01_PC_VOTING_MEMB_GOV_BODY',
 'F9_01_PZ_BEN_PAID_TO_MEMB_CURR',
 'F9_01_PZ_GRANTS_PAID_CURR',
 'F9_01_PZ_INVEST_INCOME_CURR',
 'F9_01_PZ_NAFB_EOY',
 'F9_01_PZ_ORGANIZATIONAL_MISSION',
 'F9_01_PZ_OTHER_EXPENSE_CURR',
 'F9_01_PZ_OTHER_REV_CURR',
 'F9_01_PZ_PROG_SERVICE_REV_CURR',
 'F9_01_PZ_SALARIES_CURR',
 'F9_01_PZ_TOT_EXP_CURR',
 'F9_01_PZ_TOT_REV_CURR',
 'F9_06_PC_990_PROVIDED_GOV_BODY',
 'F9_06_PC_CHANGES_ORGANIZING_DOCS',
 'F9_06_PC_CONFLICT_OF_INTEREST',
 'F9_06_PC_DECISIONS_SUBJ_APPROVAL',
 'F9_06_PC_DELEGATION_MGT_DUTIES',
 'F9_06_PC_DELEGATION_OF_MGT',
 'F9_06_PC_DOCUMENT_RET_POLICY',
 'F9_06_PC_ELECTION_BOARD_MEMBERS',
 'F9_06_PC_FAMILY_OR_BUSINESS_REL',
 'F9_06_PC_JOINT_VENTURE_INVESTMNT',
 'F9_06_PC_LOCAL_CHAPTERS',
 'F9_06_PC_MATERIAL_DIVERSION',
 'F9_06_PC_MEMBERS_OR_STOCKHOLDERS',
 'F9_06_PC_MINUTES_GOVERNING_BODY',
 'F9_06_PC_NUM_IND_VOTING_MEMBERS',
 'F9_06_PC_NUM_VOTING_GOV_MEMBERS',
 'F9_06_PC_OFFICER_MAILING_ADDRESS',
 'F9_06_PC_WHISTLEBLOWER_POLICY',
 'F9_07_PC_COMPENSATION_OTHER_SRCE',
 'F9_07_PC_FORMER_OFFICER_LISTED',
 'F9_07_PC_TOTAL_COMP_GRTR_150K',
 'F9_12_PC_ACCNT_COMPILE_OR_REVIEW',
 'F9_12_PC_FINCL_STMTS_AUDITED']

## 12/5/2020 - I don't think this is the best way to go. Instead, see what each variable is and then add column to the *concordance* file

In [15]:
integer_cols = []
for s in string_cols[:]:
    num_missing = len(df[df[s].isnull()])
    #print(num_missing)
    if num_missing == 0:
        #print ("yes")
        integer_cols.append(s)
integer_cols        

KeyboardInterrupt: 

In [16]:
set(integer_cols_feb) - set(integer_cols)

{'DLN',
 'EIN',
 'F9_00_HD_GROSS_RCPT',
 'F9_00_HD_GROUP_RETURN',
 'F9_00_HD_TAX_PER_END',
 'F9_00_HD_TAX_YEAR',
 'F9_01_PC_CONTR_GRANTS_CURR',
 'F9_01_PC_INDEP_VOTING_MEMB',
 'F9_01_PC_PROF_FUNDRISING_EXP_CURR',
 'F9_01_PC_REV_LESS_EXP_CURR',
 'F9_01_PC_TOT_ASSETS_EOY',
 'F9_01_PC_TOT_FNDR_EXP_CURR',
 'F9_01_PC_TOT_INDIV_EMPLOYED',
 'F9_01_PC_TOT_LIABILITIES_EOY',
 'F9_01_PC_TOT_UBI_GROSS',
 'F9_01_PC_VOTING_MEMB_GOV_BODY',
 'F9_01_PZ_BEN_PAID_TO_MEMB_CURR',
 'F9_01_PZ_GRANTS_PAID_CURR',
 'F9_01_PZ_INVEST_INCOME_CURR',
 'F9_01_PZ_NAFB_EOY',
 'F9_01_PZ_ORGANIZATIONAL_MISSION',
 'F9_01_PZ_OTHER_EXPENSE_CURR',
 'F9_01_PZ_OTHER_REV_CURR',
 'F9_01_PZ_PROG_SERVICE_REV_CURR',
 'F9_01_PZ_SALARIES_CURR',
 'F9_01_PZ_TOT_EXP_CURR',
 'F9_01_PZ_TOT_REV_CURR',
 'F9_06_PC_990_PROVIDED_GOV_BODY',
 'F9_06_PC_CHANGES_ORGANIZING_DOCS',
 'F9_06_PC_CONFLICT_OF_INTEREST',
 'F9_06_PC_DECISIONS_SUBJ_APPROVAL',
 'F9_06_PC_DELEGATION_MGT_DUTIES',
 'F9_06_PC_DELEGATION_OF_MGT',
 'F9_06_PC_DOCUMENT_RET_POLICY',


In [19]:
set(integer_cols) - set(integer_cols_feb)

{'F9_00_HD_BUILD_TIME_STAMP',
 'F9_00_HD_SIGNING_OFFICER_SIGNTR',
 'F9_00_HD_TIME_STAMP',
 'F9_04_PC_FR_EVENT_INC_GT_15K',
 'F9_04_PC_GAMING_INC_GT_15K',
 'F9_04_PC_PROF_FR_EXP_GT_15K'}

In [17]:
exclude_cols = ['DLN', 'EIN', 'OrganizationName', 'TaxPeriod', 'URL', 'F9_00_HD_TAX_PER_END',
                'F9_01_PZ_ORGANIZATIONAL_MISSION']
integer_cols = [col for col in integer_cols if col not in exclude_cols]
print(len(integer_cols))
print(integer_cols)

1
['F9_00_HD_BUILD_TIME_STAMP']


##### Save *string_cols* and *integer_cols*

In [20]:
import json
with open('string_cols.json', 'w') as fp:
    json.dump(string_cols, fp)
with open('integer_cols.json', 'w') as fp:
    json.dump(integer_cols, fp)    

In [18]:
import json
f = open('string_cols.json', 'r')
string_cols = json.load(f)
string_cols = [str(t) for t in string_cols]
print(len(string_cols))
print(string_cols)

150
['OrganizationName', 'URL', 'DLN', 'TaxPeriod', 'EIN', 'fiscal_year', 'F9_00_HD_BUILD_TIME_STAMP', 'F9_00_HD_ADDR_CHANGE', 'F9_00_HD_AMENDED_RETURN', 'F9_00_HD_CTRY_OF_DOMICILE', 'F9_00_HD_EXEMPT_STATUS_4847A1', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_EXEMPT_STATUS_501C3', 'F9_00_HD_FINAL_RETURN', 'F9_00_HD_GROSS_EXEMPT_NUM', 'F9_00_HD_GROSS_RCPT', 'F9_00_HD_GROUP_RETURN', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 'F9_00_HD_INITIAL_RETURN', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_SIGNING_OFFICER_SIGNTR', 'F9_00_HD_SPECIAL_CONDITION_DESC', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TAX_PER_END', 'F9_00_HD_TAX_YEAR', 'F9_00_HD_TIME_STAMP', 'F9_00_HD_TYPE_ORG_ASSOCIATION', 'F9_00_HD_TYPE_ORG_CORP', 'F9_00_HD_TYPE_ORG_OTHER', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_TYPE_ORG_TRUST', 'F9_00_HD_WEBSITE', 'F9_00_HD_YEAR_FORMED', 'F9_01_PC_BEN_PAID_MEMB_PRIOR', 'F9_01_PC_CONTR_GRANTS_CURR', 'F9_01_PC_CONTR_GRANTS_PRIOR', 'F9_01_PC_GRANTS_PRIOR', 'F9_01_PC_INDEP_VOTING_MEMB', 'F9_01_PC_INVEST_INCOME_PRIO

In [19]:
import json
f = open('integer_cols.json', 'r')
integer_cols = json.load(f)
integer_cols = [str(t) for t in integer_cols]
print(len(integer_cols))
print(integer_cols)

52
['F9_00_HD_BUILD_TIME_STAMP', 'F9_00_HD_GROSS_RCPT', 'F9_00_HD_GROUP_RETURN', 'F9_00_HD_SIGNING_OFFICER_SIGNTR', 'F9_00_HD_TAX_YEAR', 'F9_00_HD_TIME_STAMP', 'F9_01_PC_CONTR_GRANTS_CURR', 'F9_01_PC_INDEP_VOTING_MEMB', 'F9_01_PC_PROF_FUNDRISING_EXP_CURR', 'F9_01_PC_REV_LESS_EXP_CURR', 'F9_01_PC_TOT_ASSETS_EOY', 'F9_01_PC_TOT_FNDR_EXP_CURR', 'F9_01_PC_TOT_INDIV_EMPLOYED', 'F9_01_PC_TOT_LIABILITIES_EOY', 'F9_01_PC_TOT_UBI_GROSS', 'F9_01_PC_VOTING_MEMB_GOV_BODY', 'F9_01_PZ_BEN_PAID_TO_MEMB_CURR', 'F9_01_PZ_GRANTS_PAID_CURR', 'F9_01_PZ_INVEST_INCOME_CURR', 'F9_01_PZ_NAFB_EOY', 'F9_01_PZ_OTHER_EXPENSE_CURR', 'F9_01_PZ_OTHER_REV_CURR', 'F9_01_PZ_PROG_SERVICE_REV_CURR', 'F9_01_PZ_SALARIES_CURR', 'F9_01_PZ_TOT_EXP_CURR', 'F9_01_PZ_TOT_REV_CURR', 'F9_04_PC_FR_EVENT_INC_GT_15K', 'F9_04_PC_GAMING_INC_GT_15K', 'F9_04_PC_PROF_FR_EXP_GT_15K', 'F9_06_PC_990_PROVIDED_GOV_BODY', 'F9_06_PC_CHANGES_ORGANIZING_DOCS', 'F9_06_PC_CONFLICT_OF_INTEREST', 'F9_06_PC_DECISIONS_SUBJ_APPROVAL', 'F9_06_PC_DELEGATIO

In [21]:
df[integer_cols].sample(10)

Unnamed: 0,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_FINCL_STMTS_AUDITED
874704,2016-08-17 19:52:53Z,251644,0,2016-05-12,2014,2016-05-12T13:36:01-05:00,0,214,0,28772,87205,0,0,0,0,214,0,8628,0,87205,210898,61025,187273,0,219526,248298,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,1,214,214,0,1,0,0,0,0,0
546303,2015-11-30 17:44:51Z,1474405,0,2014-11-12,2013,2014-12-15T09:56:38-06:00,0,5,0,109443,1639435,0,11,245367,0,5,0,0,509,1394068,273675,62033,1410617,1090041,1363716,1473159,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,1,5,5,0,1,0,0,1,0,1
12385,2016-02-24 21:20:13Z,18936707,0,2011-02-09,2009,2011-02-10T11:22:15-06:00,4822793,137,0,36926,8741797,161007,48,5935599,942770,137,0,0,122384,2806198,4804698,454559,3230424,3788536,8593234,8630160,0,0,0,1,0,1,0,0,0,1,1,0,0,0,0,1,1,137,137,0,0,0,0,1,0,1
1684616,2019-02-21 02:37:17Z,470794,0,2019-08-22,2018,2019-09-05T14:36:38-04:00,181476,9,0,51486,158546,50011,21,5237,0,9,0,0,1865,153309,133923,2113,285340,285385,419308,470794,1,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,1,9,9,0,1,0,0,0,0,0
1442072,2018-06-14 16:35:46Z,251166,0,2018-08-06,2017,2018-10-10T10:07:03-04:00,0,460,0,-12874,12834,0,11,0,0,460,0,0,0,12834,187110,200558,50608,76930,264040,251166,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,460,460,0,0,0,0,0,0,0
1797566,2020-04-17 16:48:07Z,41621,0,2019-11-12,2018,2019-11-14T10:22:40-07:00,41621,9,0,-14916,14507,5661,2,45,0,10,0,0,0,14462,31684,0,0,24853,56537,41621,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,0,1,9,10,0,1,0,0,0,0,0
1893758,2020-09-23 17:36:50Z,575982,0,2020-02-17,2018,2020-02-18T17:11:54-06:00,369336,15,0,-157526,603978,94502,14,13695,0,15,0,0,-614,590283,451610,96535,32213,203386,654996,497470,1,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,1,15,15,0,0,0,0,0,0,1
436467,2016-03-07 17:11:31Z,928071,0,2013-10-18,2012,2013-11-05T16:26:20-06:00,287421,42,0,36066,305973,0,8,170532,0,42,0,0,-4088,135441,613171,347580,286753,268429,881600,917666,0,0,0,1,0,1,1,0,0,1,1,1,0,1,0,1,1,42,42,0,1,0,0,0,0,0
1706386,2019-02-21 02:37:17Z,5975,0,2019-09-25,2018,2019-09-25T11:36:00-07:00,5975,0,0,-1739,0,0,0,0,0,0,0,0,0,0,5514,0,0,2200,7714,5975,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0
898097,2016-08-17 19:52:53Z,2608013,0,2016-02-09,2014,2016-05-16T11:23:01-05:00,146343,8,0,358587,1814932,0,120,67425,0,8,0,0,5162,1747507,583857,29556,2419223,1657840,2241697,2600284,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,8,8,0,1,0,0,0,0,1


In [22]:
for c in integer_cols:
    print("df['%s'] = df['%s'].astype('int')" % (c, c))

df['F9_00_HD_BUILD_TIME_STAMP'] = df['F9_00_HD_BUILD_TIME_STAMP'].astype('int')
df['F9_00_HD_GROSS_RCPT'] = df['F9_00_HD_GROSS_RCPT'].astype('int')
df['F9_00_HD_GROUP_RETURN'] = df['F9_00_HD_GROUP_RETURN'].astype('int')
df['F9_00_HD_SIGNING_OFFICER_SIGNTR'] = df['F9_00_HD_SIGNING_OFFICER_SIGNTR'].astype('int')
df['F9_00_HD_TAX_YEAR'] = df['F9_00_HD_TAX_YEAR'].astype('int')
df['F9_00_HD_TIME_STAMP'] = df['F9_00_HD_TIME_STAMP'].astype('int')
df['F9_01_PC_CONTR_GRANTS_CURR'] = df['F9_01_PC_CONTR_GRANTS_CURR'].astype('int')
df['F9_01_PC_INDEP_VOTING_MEMB'] = df['F9_01_PC_INDEP_VOTING_MEMB'].astype('int')
df['F9_01_PC_PROF_FUNDRISING_EXP_CURR'] = df['F9_01_PC_PROF_FUNDRISING_EXP_CURR'].astype('int')
df['F9_01_PC_REV_LESS_EXP_CURR'] = df['F9_01_PC_REV_LESS_EXP_CURR'].astype('int')
df['F9_01_PC_TOT_ASSETS_EOY'] = df['F9_01_PC_TOT_ASSETS_EOY'].astype('int')
df['F9_01_PC_TOT_FNDR_EXP_CURR'] = df['F9_01_PC_TOT_FNDR_EXP_CURR'].astype('int')
df['F9_01_PC_TOT_INDIV_EMPLOYED'] = df['F9_01_PC_TOT_IND

#### Change numeric variables with no missing values to integer format
For very large numbers I run into this error: "OverflowError: Python int too large to convert to C long"

For those, I had to change to 'float' -- see https://stackoverflow.com/questions/38314118/overflowerror-python-int-too-large-to-convert-to-c-long-on-windows-but-not-ma

In [31]:
import sys
sys.maxsize

9223372036854775807

In [32]:
df[integer_cols].dtypes

F9_00_HD_BUILD_TIME_STAMP            object
F9_00_HD_GROSS_RCPT                  object
F9_00_HD_GROUP_RETURN                 int32
F9_00_HD_SIGNING_OFFICER_SIGNTR      object
F9_00_HD_TAX_YEAR                     int32
F9_00_HD_TIME_STAMP                  object
F9_01_PC_CONTR_GRANTS_CURR           object
F9_01_PC_INDEP_VOTING_MEMB            int32
F9_01_PC_PROF_FUNDRISING_EXP_CURR     int32
F9_01_PC_REV_LESS_EXP_CURR           object
F9_01_PC_TOT_ASSETS_EOY              object
F9_01_PC_TOT_FNDR_EXP_CURR           object
F9_01_PC_TOT_INDIV_EMPLOYED          object
F9_01_PC_TOT_LIABILITIES_EOY         object
F9_01_PC_TOT_UBI_GROSS               object
F9_01_PC_VOTING_MEMB_GOV_BODY        object
F9_01_PZ_BEN_PAID_TO_MEMB_CURR       object
F9_01_PZ_GRANTS_PAID_CURR            object
F9_01_PZ_INVEST_INCOME_CURR          object
F9_01_PZ_NAFB_EOY                    object
F9_01_PZ_OTHER_EXPENSE_CURR          object
F9_01_PZ_OTHER_REV_CURR              object
F9_01_PZ_PROG_SERVICE_REV_CURR  

In [30]:
df[integer_cols].sample(10)

Unnamed: 0,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_FINCL_STMTS_AUDITED
1054995,2017-02-10 21:41:12Z,8845668,0,2016-11-15,2015,2016-11-15T13:58:30-06:00,6671708,13,0,-1639094,6977755,1157186,167,2614012,0,13,0,3084198,117878,4363743,3933470,23704,1630319,3065035,10082703,8443609,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,1,13,13,0,1,0,0,1,0,1
52860,2016-02-24 21:20:13Z,260644,0,2011-07-29,2010,2011-07-29T12:22:30-07:00,0,0,0,-80887,547503,0,0,277553,0,7,0,0,0,269950,0,-80887,0,0,0,-80887,0,0,0,1,0,1,1,0,0,1,0,0,0,0,0,0,1,0,7,0,1,0,0,0,0,1
381979,2016-02-24 21:20:13Z,1691151,0,2013-05-13,2011,2013-05-15T12:36:12-05:00,1690797,0,0,94,38566,0,29,56119,0,2,0,0,354,-17553,371203,0,0,1319854,1691057,1691151,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,1,0,2,0,0,0,0,1,0,1
552614,2015-11-30 17:44:51Z,187536,0,2014-10-27,2013,2014-10-19T12:22:19-07:00,155311,4,0,28647,590518,0,2,891988,0,4,0,0,423,-301470,138099,0,31802,20790,158889,187536,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,1,4,4,0,0,0,0,0,0,1
1848796,2020-04-17 16:48:07Z,2428398,0,2019-11-13,2018,2019-11-15T08:29:05-06:00,945231,3,0,373998,988647,5169,15,92986,0,6,0,7110,1422,895661,1370845,29251,1452494,676445,2054400,2428398,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,1,3,6,0,0,0,0,0,0,1
1424083,2018-06-14 16:35:46Z,542590,0,2018-07-30,2017,2018-08-02T11:56:37-04:00,379613,11,0,12183,134307,45388,20,16764,829,11,0,0,0,117543,238710,14253,135955,278928,517638,529821,1,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,1,11,11,0,1,0,0,0,1,0
247070,2016-02-24 21:20:13Z,2058390,0,2012-09-20,2011,2012-10-09T13:10:33-05:00,969511,19,0,238141,1877419,0,7,370560,0,19,0,26500,19703,1506859,880263,12910,878330,735550,1642313,1880454,0,0,0,1,0,1,0,0,0,1,1,0,0,0,0,0,1,19,19,0,1,0,0,1,0,1
855578,2016-04-25 22:37:26Z,5069401,0,2016-01-31,2014,2016-02-10T12:13:35-06:00,4967313,17,0,177006,962125,176362,65,361320,0,17,0,1028626,694,600805,1037227,19995,45877,2791020,4856873,5033879,1,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,1,17,17,0,1,0,0,1,0,1
1158482,2017-02-10 21:41:12Z,107175,0,2017-05-11,2016,2017-05-11T10:49:13-00:00,30000,12,0,7346,2110203,0,0,323573,0,12,0,0,296,1786630,32967,10017,0,0,32967,40313,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,12,12,0,0,0,0,0,0,0
636154,2016-03-07 17:11:31Z,3491241,0,2014-12-06,2013,2015-01-12T13:49:05-06:00,21873,10,0,38718,5675074,3866,65,2004133,0,11,0,15500,17136,3670941,750227,0,3156280,2390844,3156571,3195289,0,0,0,1,0,1,0,1,1,1,0,0,0,0,0,0,1,10,11,0,1,0,0,1,0,1


##### 12/4/2020 - Exclude four variables from *integer_cols*

In [20]:
exclude_cols_part2 = ['F9_00_HD_BUILD_TIME_STAMP', 'F9_00_HD_SIGNING_OFFICER_SIGNTR', 'F9_00_HD_TAX_YEAR',
                     'F9_00_HD_TIME_STAMP']

In [21]:
print(len(integer_cols))
integer_cols = [c for c in integer_cols if c not in exclude_cols_part2]
print(len(integer_cols))

52
48


In [40]:
df['F9_00_HD_GROSS_RCPT'] = df['F9_00_HD_GROSS_RCPT'].astype('float')
df['F9_00_HD_GROUP_RETURN'] = df['F9_00_HD_GROUP_RETURN'].astype('int')
df['F9_01_PC_CONTR_GRANTS_CURR'] = df['F9_01_PC_CONTR_GRANTS_CURR'].astype('float')
df['F9_01_PC_INDEP_VOTING_MEMB'] = df['F9_01_PC_INDEP_VOTING_MEMB'].astype('int')
df['F9_01_PC_PROF_FUNDRISING_EXP_CURR'] = df['F9_01_PC_PROF_FUNDRISING_EXP_CURR'].astype('int')
df['F9_01_PC_REV_LESS_EXP_CURR'] = df['F9_01_PC_REV_LESS_EXP_CURR'].astype('float')
df['F9_01_PC_TOT_ASSETS_EOY'] = df['F9_01_PC_TOT_ASSETS_EOY'].astype('float')
df['F9_01_PC_TOT_FNDR_EXP_CURR'] = df['F9_01_PC_TOT_FNDR_EXP_CURR'].astype('int')
df['F9_01_PC_TOT_INDIV_EMPLOYED'] = df['F9_01_PC_TOT_INDIV_EMPLOYED'].astype('int')
df['F9_01_PC_TOT_LIABILITIES_EOY'] = df['F9_01_PC_TOT_LIABILITIES_EOY'].astype('float')
df['F9_01_PC_TOT_UBI_GROSS'] = df['F9_01_PC_TOT_UBI_GROSS'].astype('int')
df['F9_01_PC_VOTING_MEMB_GOV_BODY'] = df['F9_01_PC_VOTING_MEMB_GOV_BODY'].astype('int')
df['F9_01_PZ_BEN_PAID_TO_MEMB_CURR'] = df['F9_01_PZ_BEN_PAID_TO_MEMB_CURR'].astype('float')
df['F9_01_PZ_GRANTS_PAID_CURR'] = df['F9_01_PZ_GRANTS_PAID_CURR'].astype('float')
df['F9_01_PZ_INVEST_INCOME_CURR'] = df['F9_01_PZ_INVEST_INCOME_CURR'].astype('float')
df['F9_01_PZ_NAFB_EOY'] = df['F9_01_PZ_NAFB_EOY'].astype('float')
df['F9_01_PZ_OTHER_EXPENSE_CURR'] = df['F9_01_PZ_OTHER_EXPENSE_CURR'].astype('float')
df['F9_01_PZ_OTHER_REV_CURR'] = df['F9_01_PZ_OTHER_REV_CURR'].astype('int')
df['F9_01_PZ_PROG_SERVICE_REV_CURR'] = df['F9_01_PZ_PROG_SERVICE_REV_CURR'].astype('float')
df['F9_01_PZ_SALARIES_CURR'] = df['F9_01_PZ_SALARIES_CURR'].astype('float')
df['F9_01_PZ_TOT_EXP_CURR'] = df['F9_01_PZ_TOT_EXP_CURR'].astype('float')
df['F9_01_PZ_TOT_REV_CURR'] = df['F9_01_PZ_TOT_REV_CURR'].astype('float')
df['F9_04_PC_FR_EVENT_INC_GT_15K'] = df['F9_04_PC_FR_EVENT_INC_GT_15K'].astype('int')
df['F9_04_PC_GAMING_INC_GT_15K'] = df['F9_04_PC_GAMING_INC_GT_15K'].astype('int')
df['F9_04_PC_PROF_FR_EXP_GT_15K'] = df['F9_04_PC_PROF_FR_EXP_GT_15K'].astype('int')
df['F9_06_PC_990_PROVIDED_GOV_BODY'] = df['F9_06_PC_990_PROVIDED_GOV_BODY'].astype('int')
df['F9_06_PC_CHANGES_ORGANIZING_DOCS'] = df['F9_06_PC_CHANGES_ORGANIZING_DOCS'].astype('int')
df['F9_06_PC_CONFLICT_OF_INTEREST'] = df['F9_06_PC_CONFLICT_OF_INTEREST'].astype('int')
df['F9_06_PC_DECISIONS_SUBJ_APPROVAL'] = df['F9_06_PC_DECISIONS_SUBJ_APPROVAL'].astype('int')
df['F9_06_PC_DELEGATION_MGT_DUTIES'] = df['F9_06_PC_DELEGATION_MGT_DUTIES'].astype('int')
df['F9_06_PC_DELEGATION_OF_MGT'] = df['F9_06_PC_DELEGATION_OF_MGT'].astype('int')
df['F9_06_PC_DOCUMENT_RET_POLICY'] = df['F9_06_PC_DOCUMENT_RET_POLICY'].astype('int')
df['F9_06_PC_ELECTION_BOARD_MEMBERS'] = df['F9_06_PC_ELECTION_BOARD_MEMBERS'].astype('int')
df['F9_06_PC_FAMILY_OR_BUSINESS_REL'] = df['F9_06_PC_FAMILY_OR_BUSINESS_REL'].astype('int')
df['F9_06_PC_JOINT_VENTURE_INVESTMNT'] = df['F9_06_PC_JOINT_VENTURE_INVESTMNT'].astype('int')
df['F9_06_PC_LOCAL_CHAPTERS'] = df['F9_06_PC_LOCAL_CHAPTERS'].astype('int')
df['F9_06_PC_MATERIAL_DIVERSION'] = df['F9_06_PC_MATERIAL_DIVERSION'].astype('int')
df['F9_06_PC_MEMBERS_OR_STOCKHOLDERS'] = df['F9_06_PC_MEMBERS_OR_STOCKHOLDERS'].astype('int')
df['F9_06_PC_MINUTES_GOVERNING_BODY'] = df['F9_06_PC_MINUTES_GOVERNING_BODY'].astype('int')
df['F9_06_PC_NUM_IND_VOTING_MEMBERS'] = df['F9_06_PC_NUM_IND_VOTING_MEMBERS'].astype('int')
df['F9_06_PC_NUM_VOTING_GOV_MEMBERS'] = df['F9_06_PC_NUM_VOTING_GOV_MEMBERS'].astype('int')
df['F9_06_PC_OFFICER_MAILING_ADDRESS'] = df['F9_06_PC_OFFICER_MAILING_ADDRESS'].astype('int')
df['F9_06_PC_WHISTLEBLOWER_POLICY'] = df['F9_06_PC_WHISTLEBLOWER_POLICY'].astype('int')
df['F9_07_PC_COMPENSATION_OTHER_SRCE'] = df['F9_07_PC_COMPENSATION_OTHER_SRCE'].astype('int')
df['F9_07_PC_FORMER_OFFICER_LISTED'] = df['F9_07_PC_FORMER_OFFICER_LISTED'].astype('int')
df['F9_07_PC_TOTAL_COMP_GRTR_150K'] = df['F9_07_PC_TOTAL_COMP_GRTR_150K'].astype('int')
df['F9_12_PC_ACCNT_COMPILE_OR_REVIEW'] = df['F9_12_PC_ACCNT_COMPILE_OR_REVIEW'].astype('int')
df['F9_12_PC_FINCL_STMTS_AUDITED'] = df['F9_12_PC_FINCL_STMTS_AUDITED'].astype('int')

#### Save DF

In [41]:
%%time
df.to_pickle('all filings nov. 2020 - all control variables (with parsed sub-key variables).pkl')

Wall time: 1min 35s


In [6]:
#%%time
#df = pd.read_pickle('all filings nov. 2020 - all control variables (with parsed sub-key variables).pkl')
#print('# of columns:', len(df.columns))
#print('# of observations:', len(df))
#df[:2]

# of columns: 200
# of observations: 1895016
Wall time: 43.7 s


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,F9_09_PC_FEES_FOR_SVCE_FR_TOT,fiscal_year,Filer,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,501c3,F9_00_HD_FILER_STATE_US
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,,2010,"{'EIN': '232705170', 'Name': {'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}, 'NameControl': 'RONA', 'Phone': '8565826843', 'USAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300...",2016-02-24 21:20:13Z,1.0,,,,,1,,,1473903.0,0,,,MICHAEL ANTON,2011-11-04,,PA,2010-12-31,2010,2011-11-09T06:41:09-06:00,,1,,,,,1992,0.0,1439340.0,1044925.0,638637.0,10,30447,1753405,243131,0.0,0,0.0,0,89152.0,193604,,2440859.0,881768,195892,0,0.0,450430.0,1075372,0,0.0,10,0.0,925000.0,33563.0,1990429.0,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,459751.0,1000,0.0,0.0,0,1925215,1384751.0,171810,1473903.0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,,1,0,,0,0,0,1,1,1,10,10,0,0,,,,"[PA, NJ, DE]",0,0,0,1.0,0,0,0,0,0.0,0,1439340.0,,,,,,,,,,,,,,,1439340,1000,,1473903.0,,,,,,,,,21675.0,,215.0,,,,,,,,,,,,,,,,,,,,1384751.0,195892.0,145115.0,1043744.0,,,,256845,86228,,1,,240077.0,,332660.0,270700.0,,,,2440859.0,,,,89152,,0,1,,,1,,0,1,1,PA
1,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,201106,581805618,0.0,2011,"{'EIN': '581805618', 'Name': {'BusinessNameLine1': 'TORRINGTON VOA ELDERLY HOUSING INC', 'BusinessNameLine2': 'BELL PARK TOWER'}, 'NameControl': 'TORR', 'Phone': '7033415000', 'USAddress': {'AddressLine1': '1660 DUKE STREET', 'City': 'ALEXANDRIA'...",2016-02-24 21:20:13Z,,,,,,1,,1736.0,266420.0,0,0.0,,,2011-11-09,,WY,2011-06-30,2010,2011-11-09T07:32:06-08:00,,1,,,,,1993,,0.0,,,13,1425,1437850,189785,,0,,222839,-39085.0,-36926,,1433342.0,261190,0,0,,34577.0,224264,0,,19,0.0,0.0,828.0,1398765.0,PROVIDE HOUSING FOR THE ELDERLY AND THE DISABLED UNDER SECTION 202 OF THE NATIONAL HOUSING ACT UNDER AN AGREEMENT WITH THE DEPARTMENT OF HUD.,222550.0,0,265592.0,82955.0,71405,1455332,305505.0,17482,266420.0,0,0,0,1,1,1,0,1,1,1,1,0,1,1,,1,0,0.0,0,0,0,1,1,1,13,19,0,1,,,0.0,,0,0,1,,0,0,1,411648,,1180355,,,,,,,,,,,,,,265592.0,,0,0,265592.0,266420.0,,,,0.0,,,,0.0,7500.0,0.0,0.0,0.0,21600.0,0.0,,,17714.0,17714.0,,,59440.0,59440.0,,,5801.0,5801.0,,,,0.0,305505.0,0.0,29100.0,276405.0,,250.0,22261.0,2187206,904332,,1,,11349.0,,,0.0,7035.0,,,1433342.0,,,,-39085,,0,1,,,1,1.0,1,1,1,VA


#
#
#
# 12/4/2020 - Make it simpler by adding the variable type to the concordance file -- either date, string, float, or int
#
#
#

In [22]:
df[integer_cols].sample(5)

Unnamed: 0,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_FINCL_STMTS_AUDITED
815654,301128.0,0,167607.0,22,0,55820.0,696059.0,32543,1,495.0,0,22,0.0,0.0,19360.0,695564.0,221693.0,6100,104391.0,19945.0,241638.0,297458.0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,22,22,0,0,0,0,0,0,0
1669147,1680366.0,0,1665237.0,9,0,-192912.0,1209051.0,208605,79,9697.0,0,9,0.0,0.0,15129.0,1199354.0,746787.0,0,0.0,1126491.0,1873278.0,1680366.0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,9,9,0,1,0,0,0,0,0
1863810,4631491.0,0,0.0,4,0,804357.0,4631491.0,0,0,0.0,0,4,3827134.0,0.0,40791.0,4631491.0,0.0,0,4590700.0,0.0,3827134.0,4631491.0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,4,4,0,0,0,0,0,0,0
425541,1072114.0,0,46637.0,3,0,233160.0,519214.0,18608,9,2.0,136290,3,0.0,0.0,426254.0,519212.0,267033.0,0,403209.0,375907.0,642940.0,876100.0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,3,3,0,0,0,0,0,0,0
878570,244309.0,0,9285.0,5,0,-37740.0,91615.0,0,2,2331.0,0,5,0.0,92164.0,139.0,89284.0,151680.0,5289,227367.0,35976.0,279820.0,242080.0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,5,5,0,0,0,0,0,0,0


### Convert additional remaining string variables to float format

<br>Re-generate *string_cols*

In [24]:
string_cols = df.select_dtypes(include='object').columns.tolist()
print(len(string_cols))
print(string_cols, '\n')
#df[string_cols].describe().T

102
['OrganizationName', 'URL', 'DLN', 'TaxPeriod', 'EIN', 'fiscal_year', 'Filer', 'F9_00_HD_BUILD_TIME_STAMP', 'F9_00_HD_ADDR_CHANGE', 'F9_00_HD_AMENDED_RETURN', 'F9_00_HD_CTRY_OF_DOMICILE', 'F9_00_HD_EXEMPT_STATUS_4847A1', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_EXEMPT_STATUS_501C3', 'F9_00_HD_FINAL_RETURN', 'F9_00_HD_GROSS_EXEMPT_NUM', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 'F9_00_HD_INITIAL_RETURN', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_SIGNING_OFFICER_SIGNTR', 'F9_00_HD_SPECIAL_CONDITION_DESC', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TAX_PER_END', 'F9_00_HD_TIME_STAMP', 'F9_00_HD_TYPE_ORG_ASSOCIATION', 'F9_00_HD_TYPE_ORG_CORP', 'F9_00_HD_TYPE_ORG_OTHER', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_TYPE_ORG_TRUST', 'F9_00_HD_WEBSITE', 'F9_00_HD_YEAR_FORMED', 'F9_01_PC_BEN_PAID_MEMB_PRIOR', 'F9_01_PC_CONTR_GRANTS_PRIOR', 'F9_01_PC_GRANTS_PRIOR', 'F9_01_PC_INVEST_INCOME_PRIOR', 'F9_01_PC_NET_ASSETS_BOY', 'F9_01_PC_OTHER_EXPENSE_PRIOR', 'F9_01_PC_OTHER_REV_PRIOR', 'F9_01_PC_PROF_FUNDRISING_EXP

In [41]:
exclude_cols = ['DLN', 'EIN', 'OrganizationName', 'TaxPeriod', 'URL', 'F9_00_HD_TAX_PER_END',
                'F9_01_PZ_ORGANIZATIONAL_MISSION', 'F9_00_HD_BUILD_TIME_STAMP', 'F9_00_HD_SIGNING_OFFICER_SIGNTR',
                'F9_00_HD_TAX_YEAR', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_CTRY_OF_DOMICILE',
                'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_FILER_STATE_US', 'F9_00_HD_TYPE_ORG_OTHER_DESC',
                'F9_00_HD_WEBSITE', 'F9_06_PC_STATES_WHERE_RET_FILED', 'F9_00_HD_YEAR_FORMED',
                'Filer', 'fiscal_year', 'F9_00_HD_TIME_STAMP', 'F9_12_PC_ACCTG_METHOD_OTHER']
float_cols = [col for col in string_cols if col not in exclude_cols]
print(len(float_cols))
print(float_cols)

80
['F9_00_HD_ADDR_CHANGE', 'F9_00_HD_AMENDED_RETURN', 'F9_00_HD_EXEMPT_STATUS_4847A1', 'F9_00_HD_EXEMPT_STATUS_501C3', 'F9_00_HD_FINAL_RETURN', 'F9_00_HD_GROSS_EXEMPT_NUM', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 'F9_00_HD_INITIAL_RETURN', 'F9_00_HD_SPECIAL_CONDITION_DESC', 'F9_00_HD_TYPE_ORG_ASSOCIATION', 'F9_00_HD_TYPE_ORG_CORP', 'F9_00_HD_TYPE_ORG_OTHER', 'F9_00_HD_TYPE_ORG_TRUST', 'F9_01_PC_BEN_PAID_MEMB_PRIOR', 'F9_01_PC_CONTR_GRANTS_PRIOR', 'F9_01_PC_GRANTS_PRIOR', 'F9_01_PC_INVEST_INCOME_PRIOR', 'F9_01_PC_NET_ASSETS_BOY', 'F9_01_PC_OTHER_EXPENSE_PRIOR', 'F9_01_PC_OTHER_REV_PRIOR', 'F9_01_PC_PROF_FUNDRISING_EXP_PRIOR', 'F9_01_PC_PROG_SERVICE_REV_PRIOR', 'F9_01_PC_REV_LESS_EXP_PRIOR', 'F9_01_PC_TERMINATION_CONTRACTION', 'F9_01_PC_TOT_EXP_PRIOR', 'F9_01_PC_TOT_INDIV_VOLUNTEERS', 'F9_01_PC_TOT_REVENUE_PRIOR', 'F9_01_PC_TOT_UBI_NET', 'F9_01_PZ_SALARIES_PRIOR', 'F9_01_PZ_TOT_ASSETS_BOY', 'F9_01_PZ_TOT_LIAB_BOY', 'F9_06_PC_ANNUAL_DISC_COVRD_PERS', 'F9_06_PC_CEO_COMPENSTN_PROCESS', 'F9_06_PC_

In [42]:
df[float_cols].sample(10)

Unnamed: 0,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_TRUST,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_NET,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_LIAB_BOY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED
878004,,,,,,,,,,,1.0,,,105306.0,,,601.0,67341,297904,,,405800.0,-41809,,448210,0.0,406401,,45000.0,99521,32180,,0,,1.0,,1,,0,,,,,,,,42600.0,,,,,,,,,,,,,,,390274,,,,390274,40993.0,40426.0,,1.0,,,,9763,,,1.0,1.0,,0.0
1252943,,,,1.0,,,,,,,1.0,,,0.0,4403078.0,727614.0,484266.0,29418445,2079536,-11629.0,0.0,443400.0,2511965,,2807150,11.0,5319115,0.0,0.0,29692381,273936,1.0,0,,1.0,,1,1.0,0,1.0,,,1.0,0.0,0.0,0.0,0.0,0.0,2571270.0,71843.0,,,101692.0,71843.0,116875.0,,,,,,153968.0,446552,,2643113.0,,446552,169259.0,169259.0,1.0,,,,,824355,2947558.0,1.0,,1.0,,0.0
810299,,,,,,399.0,,,,,1.0,,,,58.0,,808.0,271379,92572,58238.0,,64488.0,-4716,,128308,,123592,,35736.0,283567,12188,,0,,1.0,,1,,0,,,,,,,,67727.0,,2102.0,,245238.0,,,,,30486.0,37624.0,,331550.0,,,68752,,2102.0,10154.0,68752,647705.0,415162.0,,1.0,,,,37190,,,1.0,,,0.0
646180,,,,,,,0.0,,,,1.0,,,,,5300.0,611.0,572723,174237,2548.0,,262258.0,7799,,257618,85.0,265417,,78081.0,932835,360112,1.0,1,,1.0,,0,1.0,1,,,1.0,,0.0,0.0,,22075.0,,,,,,,,,,,,,,,270398,,0.0,0.0,270398,767047.0,32913.0,1.0,,,,,-7751,,,1.0,0.0,,0.0
999834,,,,,,,,,,1.0,,,,0.0,17700.0,0.0,5414.0,1902128,546031,0.0,0.0,1107608.0,247642,,883080,0.0,1130722,0.0,337049.0,1904781,2653,1.0,0,,1.0,,1,1.0,0,,,1.0,,1.0,0.0,0.0,191961.0,0.0,18500.0,,,,,,,,,,,,,1146731,,18500.0,,1146731,105478.0,75707.0,1.0,,,,,86130,1664.0,,1.0,1.0,,0.0
282252,,,,1.0,,,,,,,1.0,,,,933.0,,36.0,11480,102811,,,97548.0,-4294,,102811,,98517,,,682726,671246,,0,,,,1,,0,,,,1.0,,,,,,,,,,,1340.0,,,,,,,,98048,,1340.0,,98048,953365.0,339134.0,,1.0,,,,5071,,1.0,,,,
1393765,,,,,,,,,,,1.0,,,,30230.0,74697.0,705.0,379880,418964,715862.0,,240002.0,63371,,923428,1535.0,986799,24683.0,429767.0,990714,610834,1.0,1,,1.0,,0,1.0,1,,,,,,,,9047.0,,7423.0,,413632.0,,11431.0,,31343.0,21087.0,61706.0,,1042864.0,,,247966,14500.0,21923.0,5760.0,247966,1358614.0,722080.0,1.0,,,,,5972,,1.0,,1.0,,0.0
1136636,,,,1.0,,,0.0,,,,1.0,,,,132736.0,76506.0,61739.0,2525068,47299,2036.0,,,24818,,171693,27.0,196511,,47888.0,2526112,1044,1.0,0,1.0,1.0,,1,0.0,0,,1.0,,1.0,0.0,0.0,,36163.0,,387030.0,,,,125.0,,1590.0,,,,,,,0,,387030.0,0.0,0,9359.0,9359.0,1.0,,,,,9721,29544.0,1.0,,1.0,,0.0
179820,,,,1.0,,,,,,,1.0,,,,2122419.0,,,1144110,1992714,41681.0,,2440645.0,1176405,,3428340,,4604745,,1435626.0,1238610,94500,1.0,1,,1.0,,1,1.0,0,,,,,,3.0,,556575.0,,1475114.0,,,,,,,,,,,,,2919177,,1475114.0,50615.0,2919177,304009.0,55095.0,1.0,,,,,175680,,1.0,,0.0,,0.0
1434041,1.0,,,,,,,,,,,,1.0,1851187.0,0.0,0.0,205457.0,4551324,91545,8785.0,0.0,2195110.0,396030,,2013322,0.0,2409352,0.0,70590.0,4556474,5150,,0,,1.0,,1,,0,,,,,0.0,0.0,3380.0,0.0,72610.0,,,,,,,,,,,,,,2119994,,,,2119994,,,,1.0,,,,219100,245917.0,1.0,,1.0,,0.0


In [40]:
#df['F9_12_PC_ACCTG_METHOD_CASH'].value_counts()[:10]

1    591946
Name: F9_12_PC_ACCTG_METHOD_CASH, dtype: int64

<br>Exclude additional columns based on above

In [47]:
exclude_cols = exclude_cols + ['F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TYPE_ORG_OTHER_DESC',
'F9_00_HD_WEBSITE', 'F9_06_PC_STATES_WHERE_RET_FILED']
print(exclude_cols)

['DLN', 'EIN', 'OrganizationName', 'TaxPeriod', 'URL', 'F9_00_HD_TAX_PER_END', 'F9_01_PZ_ORGANIZATIONAL_MISSION', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_WEBSITE', 'F9_06_PC_STATES_WHERE_RET_FILED']


In [44]:
print(len(float_cols))
float_cols = [col for col in string_cols if col not in exclude_cols]
print(len(float_cols))
print(float_cols)

80
80
['F9_00_HD_ADDR_CHANGE', 'F9_00_HD_AMENDED_RETURN', 'F9_00_HD_EXEMPT_STATUS_4847A1', 'F9_00_HD_EXEMPT_STATUS_501C3', 'F9_00_HD_FINAL_RETURN', 'F9_00_HD_GROSS_EXEMPT_NUM', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 'F9_00_HD_INITIAL_RETURN', 'F9_00_HD_SPECIAL_CONDITION_DESC', 'F9_00_HD_TYPE_ORG_ASSOCIATION', 'F9_00_HD_TYPE_ORG_CORP', 'F9_00_HD_TYPE_ORG_OTHER', 'F9_00_HD_TYPE_ORG_TRUST', 'F9_01_PC_BEN_PAID_MEMB_PRIOR', 'F9_01_PC_CONTR_GRANTS_PRIOR', 'F9_01_PC_GRANTS_PRIOR', 'F9_01_PC_INVEST_INCOME_PRIOR', 'F9_01_PC_NET_ASSETS_BOY', 'F9_01_PC_OTHER_EXPENSE_PRIOR', 'F9_01_PC_OTHER_REV_PRIOR', 'F9_01_PC_PROF_FUNDRISING_EXP_PRIOR', 'F9_01_PC_PROG_SERVICE_REV_PRIOR', 'F9_01_PC_REV_LESS_EXP_PRIOR', 'F9_01_PC_TERMINATION_CONTRACTION', 'F9_01_PC_TOT_EXP_PRIOR', 'F9_01_PC_TOT_INDIV_VOLUNTEERS', 'F9_01_PC_TOT_REVENUE_PRIOR', 'F9_01_PC_TOT_UBI_NET', 'F9_01_PZ_SALARIES_PRIOR', 'F9_01_PZ_TOT_ASSETS_BOY', 'F9_01_PZ_TOT_LIAB_BOY', 'F9_06_PC_ANNUAL_DISC_COVRD_PERS', 'F9_06_PC_CEO_COMPENSTN_PROCESS', 'F9_06_

# Ended here

In [49]:
new_variables_df[:1]

Unnamed: 0,variable_name_new,original_names,sub_keys,len,len_subkeys
0,F9_08_PC_TOTAL_REVENUE,"[TotalRevenueGrp, TotalRevenue]","[TotalRevenueColumnAmt, TotalRevenueColumn]",2,2


In [50]:
concordance[concordance['variable_name_new'].isin(exclude_cols)][['variable_name_new', 'sub_key', 'data_type_xsd']][:5]

Unnamed: 0,variable_name_new,sub_key,data_type_xsd
0,F9_00_HD_TAX_PER_END,,DateType
1,F9_00_HD_TAX_PER_END,,DateType
4,TaxPeriod,,
15,F9_00_HD_PRIN_OFF_NAME,,PersonNameType
16,F9_00_HD_PRIN_OFF_NAME,,PersonNameType


In [51]:
check_vars = concordance[concordance['variable_name_new'].isin(float_cols)][['variable_name_new', 
                                     'sub_key', 'data_type_xsd']].groupby('variable_name_new').first()
print(len(check_vars))
check_vars = check_vars.reset_index()
check_vars[:5]

83


Unnamed: 0,variable_name_new,sub_key,data_type_xsd
0,F9_00_HD_ADDR_CHANGE,,CheckboxType
1,F9_00_HD_AMENDED_RETURN,,CheckboxType
2,F9_00_HD_CTRY_OF_DOMICILE,,CountryType
3,F9_00_HD_EXEMPT_STATUS_4847A1,,CheckboxType
4,F9_00_HD_EXEMPT_STATUS_501C3,,CheckboxType


##### Save check_vars

In [52]:
check_vars.to_pickle('check_vars.pkl')

#### Run for *CheckboxType*

In [53]:
check_vars['data_type_xsd'].value_counts()

USAmountType      40
CheckboxType      21
BooleanType       11
USAmountNNType     4
CountType          2
TextType           1
IntegerNNType      1
YearType           1
CountryType        1
StringType         1
Name: data_type_xsd, dtype: int64

In [54]:
check_vars[check_vars['data_type_xsd']=='CheckboxType']['variable_name_new'].tolist()

['F9_00_HD_ADDR_CHANGE',
 'F9_00_HD_AMENDED_RETURN',
 'F9_00_HD_EXEMPT_STATUS_4847A1',
 'F9_00_HD_EXEMPT_STATUS_501C3',
 'F9_00_HD_FINAL_RETURN',
 'F9_00_HD_INITIAL_RETURN',
 'F9_00_HD_TYPE_ORG_ASSOCIATION',
 'F9_00_HD_TYPE_ORG_CORP',
 'F9_00_HD_TYPE_ORG_OTHER',
 'F9_00_HD_TYPE_ORG_TRUST',
 'F9_01_PC_TERMINATION_CONTRACTION',
 'F9_06_PC_FORM_AVAIL_OWN_WEBSITE',
 'F9_06_PC_FORM_UPON_REQUEST',
 'F9_06_PC_OTHER_WEBSITE',
 'F9_06_PC_OWN_WEBSITE',
 'F9_07_PC_NO_LISTED_PERS_COMPENSD',
 'F9_10_PC_ORG_FOLLOWS_SFAS117',
 'F9_10_PC_ORG_NOT_FOLLOW_SFAS117',
 'F9_12_PC_ACCTG_METHOD_ACCRUAL',
 'F9_12_PC_ACCTG_METHOD_CASH',
 'F9_12_PC_ACCTG_METHOD_OTHER']

<br>Inspect descriptives below. 
- Looking at the number of *unique* values, all have a value of '1' and a *top* value of '1', except for *F9_12_PC_ACCTG_METHOD_OTHER*
- So, all except for the *F9_12_PC_ACCTG_METHOD_OTHER* can be made 'float'

In [55]:
df[check_vars[check_vars['data_type_xsd']=='CheckboxType']['variable_name_new'].tolist()].describe().T

Unnamed: 0,count,unique,top,freq
F9_00_HD_ADDR_CHANGE,67749,1,1,67749
F9_00_HD_AMENDED_RETURN,14996,1,1,14996
F9_00_HD_EXEMPT_STATUS_4847A1,1439,1,1,1439
F9_00_HD_EXEMPT_STATUS_501C3,1278859,1,1,1278859
F9_00_HD_FINAL_RETURN,9129,1,1,9129
F9_00_HD_INITIAL_RETURN,16214,1,1,16214
F9_00_HD_TYPE_ORG_ASSOCIATION,76367,1,1,76367
F9_00_HD_TYPE_ORG_CORP,1517294,1,1,1517294
F9_00_HD_TYPE_ORG_OTHER,42243,1,1,42243
F9_00_HD_TYPE_ORG_TRUST,55786,1,1,55786


In [56]:
exclude_cols = exclude_cols + ['F9_12_PC_ACCTG_METHOD_OTHER']
print(exclude_cols)

['DLN', 'EIN', 'OrganizationName', 'TaxPeriod', 'URL', 'F9_00_HD_TAX_PER_END', 'F9_01_PZ_ORGANIZATIONAL_MISSION', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_WEBSITE', 'F9_06_PC_STATES_WHERE_RET_FILED', 'F9_12_PC_ACCTG_METHOD_OTHER']


#### Run for *USAmountType*

In [57]:
df[check_vars[check_vars['data_type_xsd']=='USAmountType']['variable_name_new'].tolist()].sample(5)

Unnamed: 0,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_NET,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_LIAB_BOY,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN
4768,,797272,,,46893,756560,,,,-1929,799201,797272,,42641.0,48148,1255.0,,,,671128,,,,,,,,,,0.0,,0.0,0.0,30799.0,9781.0,,,,4328,
8046,0.0,19597753,594054.0,71778.0,29207227,11050321,276105.0,0.0,6130553.0,1329155,24747034,26076189,,13102659.0,38321755,9114528.0,308599.0,2528870.0,0.0,1147923,,,0.0,,0.0,3500722.0,3793600.0,19535919.0,,7883686.0,,183917.0,7883686.0,26825296.0,10004617.0,,,,4772851,
9725,,101516384,23938944.0,6958062.0,1209427321,77405377,3761169.0,,129007419.0,7986357,233256677,241243034,,131912356.0,1348823254,139395933.0,1101758.0,4600477.0,,58462093,347717.0,,232651.0,347717.0,100110.0,,,8727734.0,1471502.0,139535204.0,43147006.0,10683919.0,139535204.0,254387257.0,156355689.0,,,,26672006,-4993505.0
4228,,24120,,,850,24455,,,,-335,24455,24120,,,850,,,,,22120,,,,,,,,,,,,,,,,,,,450,
4761,0.0,308742,0.0,0.0,244729,90261,0.0,0.0,157623.0,244444,221921,466365,0.0,131660.0,252262,7533.0,0.0,0.0,0.0,187532,,,,,,,,,,87518.0,,,87518.0,11888.0,5944.0,,,,-53289,


In [59]:
df[check_vars[check_vars['data_type_xsd']=='USAmountType']['variable_name_new'].tolist()].describe().T

Unnamed: 0,count,unique,top,freq
F9_01_PC_BEN_PAID_MEMB_PRIOR,803660,62527,0,727311
F9_01_PC_CONTR_GRANTS_PRIOR,1489779,728074,0,191483
F9_01_PC_GRANTS_PRIOR,980072,252925,0,502900
F9_01_PC_INVEST_INCOME_PRIOR,1456806,299513,0,130308
F9_01_PC_NET_ASSETS_BOY,1705809,1236048,0,8086
F9_01_PC_OTHER_EXPENSE_PRIOR,1666525,896106,0,18796
F9_01_PC_OTHER_REV_PRIOR,1299060,355672,0,265671
F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,790628,37224,0,727734
F9_01_PC_PROG_SERVICE_REV_PRIOR,1367237,764528,0,216726
F9_01_PC_REV_LESS_EXP_PRIOR,1673463,694535,0,15169


#### Run for *BooleanType*

In [164]:
df[check_vars[check_vars['data_type_xsd']=='BooleanType']['variable_name_new'].tolist()].sample(5)

Unnamed: 0,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED
6002,False,1.0,1,0.0,1,1.0,1,0.0,0.0,0.0,0
9062,,,1,,1,,1,,1.0,,0
413,,,0,,1,,0,,,,0
737,,,0,,1,,0,,1.0,,0
2050,False,1.0,0,,0,0.0,0,,0.0,,0


In [60]:
df[check_vars[check_vars['data_type_xsd']=='BooleanType']['variable_name_new'].tolist()].describe().T

Unnamed: 0,count,unique,top,freq
F9_00_HD_INCLUDES_SUBORD_ORGS,278169,58,false,265079
F9_06_PC_ANNUAL_DISC_COVRD_PERS,1210837,2,1,982082
F9_06_PC_CEO_COMPENSTN_PROCESS,1716794,2,0,954858
F9_06_PC_JOINT_VENTURE_POLICY,95449,2,0,74862
F9_06_PC_MINUTES_COMMITTEES,1720824,2,1,1468000
F9_06_PC_MONITORING_OF_COI_POLICY,1209677,2,1,928131
F9_06_PC_OTHER_COMPENSTN_PROCESS,1714798,2,0,1166997
F9_06_PC_POLICIES_GOVERN_CHAPTER,142560,2,1,82749
F9_12_PC_AUDIT_COMMITTEE,1034065,2,1,830808
F9_12_PC_FED_GRNT_AUDIT_PERFORMD,220872,2,1,168417


In [61]:
exclude_cols = exclude_cols + ['F9_00_HD_INCLUDES_SUBORD_ORGS']
print(len(exclude_cols))
print(exclude_cols)

15
['DLN', 'EIN', 'OrganizationName', 'TaxPeriod', 'URL', 'F9_00_HD_TAX_PER_END', 'F9_01_PZ_ORGANIZATIONAL_MISSION', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_WEBSITE', 'F9_06_PC_STATES_WHERE_RET_FILED', 'F9_12_PC_ACCTG_METHOD_OTHER', 'F9_00_HD_INCLUDES_SUBORD_ORGS']


#### Run for *USAmountNNType*

In [62]:
df[check_vars[check_vars['data_type_xsd']=='USAmountNNType']['variable_name_new'].tolist()].sample(5)

Unnamed: 0,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_TOTAL_CONTRIBUTIONS
110,,,,1002569
8750,,,,32590
9034,,,,180590
268,60864.0,570956.0,,97553
9152,,,,6983072


In [63]:
df[check_vars[check_vars['data_type_xsd']=='USAmountNNType']['variable_name_new'].tolist()].describe().T

Unnamed: 0,count,unique,top,freq
F9_08_PC_COST_OF_GOODS_SOLD,232584,123208,0,30569
F9_08_PC_GROSS_SALES_INVENTORY,245777,144129,0,22924
F9_08_PC_MEMBERSHIP_DUES,303045,137490,0,26324
F9_08_PC_TOTAL_CONTRIBUTIONS,1433987,750277,0,92186


#### Run for *CountType*

In [64]:
df[check_vars[check_vars['data_type_xsd']=='CountType']['variable_name_new'].tolist()].sample(5)

Unnamed: 0,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K
3766,,
1693,5.0,102.0
7957,0.0,15.0
6937,0.0,0.0
2030,0.0,0.0


In [65]:
df[check_vars[check_vars['data_type_xsd']=='CountType']['variable_name_new'].tolist()].describe().T

Unnamed: 0,count,unique,top,freq
F9_07_PC_NUM_CONTRCTRS_GRTR_100K,1087806,837,0,853103
F9_07_PC_NUM_INDS_GREATER_100K,1224114,1884,0,856881


#### Run for *IntegerNNType*

In [66]:
df[check_vars[check_vars['data_type_xsd']=='IntegerNNType']['variable_name_new'].tolist()].sample(5)

Unnamed: 0,F9_01_PC_TOT_INDIV_VOLUNTEERS
1495,
2039,
3109,3.0
4161,20.0
8707,65.0


In [67]:
df[check_vars[check_vars['data_type_xsd']=='IntegerNNType']['variable_name_new'].tolist()].describe().T

Unnamed: 0,count,unique,top,freq
F9_01_PC_TOT_INDIV_VOLUNTEERS,1270620,9028,0,332728


#### Run for *YearType*

In [68]:
df[check_vars[check_vars['data_type_xsd']=='YearType']['variable_name_new'].tolist()].sample(5)

Unnamed: 0,F9_00_HD_YEAR_FORMED
8075,2009
5659,1970
4226,1984
6288,1972
3379,1985


In [69]:
df[check_vars[check_vars['data_type_xsd']=='YearType']['variable_name_new'].tolist()].describe().T

Unnamed: 0,count,unique,top,freq
F9_00_HD_YEAR_FORMED,1590812,320,2000,37797


#### Run for *CountryType*

In [70]:
df[check_vars[check_vars['data_type_xsd']=='CountryType']['variable_name_new'].tolist()].sample(5)

Unnamed: 0,F9_00_HD_CTRY_OF_DOMICILE
600,
4279,
8704,
9642,
4160,


In [71]:
df[df['F9_00_HD_CTRY_OF_DOMICILE'].notnull()][['F9_00_HD_CTRY_OF_DOMICILE']].sample(2)

Unnamed: 0,F9_00_HD_CTRY_OF_DOMICILE
7595,SZ
5605,UK


In [72]:
df[check_vars[check_vars['data_type_xsd']=='CountryType']['variable_name_new'].tolist()].describe().T

Unnamed: 0,count,unique,top,freq
F9_00_HD_CTRY_OF_DOMICILE,912,68,CA,212


In [73]:
exclude_cols = exclude_cols + ['F9_00_HD_CTRY_OF_DOMICILE']
print(len(exclude_cols))
print(exclude_cols)

16
['DLN', 'EIN', 'OrganizationName', 'TaxPeriod', 'URL', 'F9_00_HD_TAX_PER_END', 'F9_01_PZ_ORGANIZATIONAL_MISSION', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_WEBSITE', 'F9_06_PC_STATES_WHERE_RET_FILED', 'F9_12_PC_ACCTG_METHOD_OTHER', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 'F9_00_HD_CTRY_OF_DOMICILE']


#### Run for *StringType*

In [74]:
df[check_vars[check_vars['data_type_xsd']=='StringType']['variable_name_new'].tolist()].sample(5)

Unnamed: 0,F9_00_HD_GROSS_EXEMPT_NUM
961,
1789,
8962,
9153,
3558,


In [75]:
df[df['F9_00_HD_GROSS_EXEMPT_NUM'].notnull()][['F9_00_HD_GROSS_EXEMPT_NUM']].sample(2)

Unnamed: 0,F9_00_HD_GROSS_EXEMPT_NUM
7247,928
9809,1732


In [76]:
df[check_vars[check_vars['data_type_xsd']=='StringType']['variable_name_new'].tolist()].describe().T

Unnamed: 0,count,unique,top,freq
F9_00_HD_GROSS_EXEMPT_NUM,60429,1827,928,10822


In [77]:
exclude_cols = exclude_cols + ['F9_00_HD_GROSS_EXEMPT_NUM']
print(len(exclude_cols))
print(exclude_cols)

17
['DLN', 'EIN', 'OrganizationName', 'TaxPeriod', 'URL', 'F9_00_HD_TAX_PER_END', 'F9_01_PZ_ORGANIZATIONAL_MISSION', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_WEBSITE', 'F9_06_PC_STATES_WHERE_RET_FILED', 'F9_12_PC_ACCTG_METHOD_OTHER', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 'F9_00_HD_CTRY_OF_DOMICILE', 'F9_00_HD_GROSS_EXEMPT_NUM']


#### Run for *TextType*

In [78]:
df[check_vars[check_vars['data_type_xsd']=='TextType']['variable_name_new'].tolist()].sample(5)

Unnamed: 0,F9_00_HD_SPECIAL_CONDITION_DESC
679,
2292,
4543,
6977,
7838,


In [79]:
df[check_vars[check_vars['data_type_xsd']=='TextType']['variable_name_new'].tolist()].describe().T

Unnamed: 0,count,unique,top,freq
F9_00_HD_SPECIAL_CONDITION_DESC,682,383,PUBLIC DISCLOSURE COPY,26


In [80]:
df[df['F9_00_HD_SPECIAL_CONDITION_DESC'].notnull()][['F9_00_HD_SPECIAL_CONDITION_DESC']].sample(2)

Unnamed: 0,F9_00_HD_SPECIAL_CONDITION_DESC
8447,IR201750 HURRICANE IRMA LATE FILING RELIEF
4130,EXTENSION GRANTED TO 51516


In [81]:
exclude_cols = exclude_cols + ['F9_00_HD_SPECIAL_CONDITION_DESC']
print(len(exclude_cols))
print(exclude_cols)

18
['DLN', 'EIN', 'OrganizationName', 'TaxPeriod', 'URL', 'F9_00_HD_TAX_PER_END', 'F9_01_PZ_ORGANIZATIONAL_MISSION', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_WEBSITE', 'F9_06_PC_STATES_WHERE_RET_FILED', 'F9_12_PC_ACCTG_METHOD_OTHER', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 'F9_00_HD_CTRY_OF_DOMICILE', 'F9_00_HD_GROSS_EXEMPT_NUM', 'F9_00_HD_SPECIAL_CONDITION_DESC']


In [82]:
print(len(float_cols))
float_cols = [col for col in string_cols if col not in exclude_cols]
print(len(float_cols))
print(float_cols)

83
78
['F9_00_HD_ADDR_CHANGE', 'F9_00_HD_AMENDED_RETURN', 'F9_00_HD_EXEMPT_STATUS_4847A1', 'F9_00_HD_EXEMPT_STATUS_501C3', 'F9_00_HD_FINAL_RETURN', 'F9_00_HD_INITIAL_RETURN', 'F9_00_HD_TYPE_ORG_ASSOCIATION', 'F9_00_HD_TYPE_ORG_CORP', 'F9_00_HD_TYPE_ORG_OTHER', 'F9_00_HD_TYPE_ORG_TRUST', 'F9_00_HD_YEAR_FORMED', 'F9_01_PC_BEN_PAID_MEMB_PRIOR', 'F9_01_PC_CONTR_GRANTS_PRIOR', 'F9_01_PC_GRANTS_PRIOR', 'F9_01_PC_INVEST_INCOME_PRIOR', 'F9_01_PC_NET_ASSETS_BOY', 'F9_01_PC_OTHER_EXPENSE_PRIOR', 'F9_01_PC_OTHER_REV_PRIOR', 'F9_01_PC_PROF_FUNDRISING_EXP_PRIOR', 'F9_01_PC_PROG_SERVICE_REV_PRIOR', 'F9_01_PC_REV_LESS_EXP_PRIOR', 'F9_01_PC_TERMINATION_CONTRACTION', 'F9_01_PC_TOT_EXP_PRIOR', 'F9_01_PC_TOT_INDIV_VOLUNTEERS', 'F9_01_PC_TOT_REVENUE_PRIOR', 'F9_01_PC_TOT_UBI_NET', 'F9_01_PZ_SALARIES_PRIOR', 'F9_01_PZ_TOT_ASSETS_BOY', 'F9_01_PZ_TOT_LIAB_BOY', 'F9_06_PC_ANNUAL_DISC_COVRD_PERS', 'F9_06_PC_CEO_COMPENSTN_PROCESS', 'F9_06_PC_FORM_AVAIL_OWN_WEBSITE', 'F9_06_PC_FORM_UPON_REQUEST', 'F9_06_PC_JOINT

<br>Data Types for variables not in *float_cols*

In [83]:
set(check_vars[~check_vars['variable_name_new'].isin(float_cols)]['data_type_xsd'].tolist())

{'BooleanType', 'CheckboxType', 'CountryType', 'StringType', 'TextType'}

<br>Data Types for variables in *float_cols*

In [84]:
set(check_vars[check_vars['variable_name_new'].isin(float_cols)]['data_type_xsd'].tolist())

{'BooleanType',
 'CheckboxType',
 'CountType',
 'IntegerNNType',
 'USAmountNNType',
 'USAmountType',
 'YearType'}

<br>Intersection of Data Types for variables in and not in *float_cols*

In [85]:
set(check_vars[check_vars['variable_name_new'].isin(float_cols)]['data_type_xsd'].tolist()).intersection(set(check_vars[~check_vars['variable_name_new'].isin(float_cols)]['data_type_xsd'].tolist()))

{'BooleanType', 'CheckboxType'}

<br>Data Types that are only for variables in *float_cols*

In [86]:
set(check_vars[check_vars['variable_name_new'].isin(float_cols)]['data_type_xsd'].tolist())- set(check_vars[~check_vars['variable_name_new'].isin(float_cols)]['data_type_xsd'].tolist())

{'CountType', 'IntegerNNType', 'USAmountNNType', 'USAmountType', 'YearType'}

In [87]:
print(len(check_vars))
float_types = ['CountType', 'IntegerNNType', 'USAmountNNType', 'USAmountType', 'YearType']
print(len(check_vars[check_vars['data_type_xsd'].isin(float_types)]))
check_vars[check_vars['data_type_xsd'].isin(float_types)]

83
48


Unnamed: 0,variable_name_new,sub_key,data_type_xsd
14,F9_00_HD_YEAR_FORMED,,YearType
15,F9_01_PC_BEN_PAID_MEMB_PRIOR,,USAmountType
16,F9_01_PC_CONTR_GRANTS_PRIOR,,USAmountType
17,F9_01_PC_GRANTS_PRIOR,,USAmountType
18,F9_01_PC_INVEST_INCOME_PRIOR,,USAmountType
19,F9_01_PC_NET_ASSETS_BOY,,USAmountType
20,F9_01_PC_OTHER_EXPENSE_PRIOR,,USAmountType
21,F9_01_PC_OTHER_REV_PRIOR,,USAmountType
22,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,,USAmountType
23,F9_01_PC_PROG_SERVICE_REV_PRIOR,,USAmountType


In [88]:
print(len(check_vars[check_vars['data_type_xsd'].isin(float_types)]))
print(len(check_vars[(check_vars['data_type_xsd'].isin(float_types)) & 
                    (~check_vars['variable_name_new'].isin(exclude_cols))]))

48
48


In [89]:
df[check_vars[check_vars['data_type_xsd'].isin(float_types)]['variable_name_new'].tolist()].sample(10)

Unnamed: 0,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_NET,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_LIAB_BOY,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN
1975,2003.0,,273952.0,25260.0,681,60719,91727.0,27466.0,,67532,47810,321821,,369631,,204834.0,102572,41853.0,0.0,1,,120272.0,,2712.0,,,,26972.0,,49253.0,,,276376.0,,,,59808.0,,279088.0,2580.0,59808.0,,,,,,24029,
2352,2002.0,0.0,6468.0,0.0,0,7088,32928.0,0.0,0.0,19452,-7008,32928,5.0,25920,0.0,0.0,373038,365950.0,0.0,0,51715.0,0.0,227640.0,,,,,,,,,,5144.0,,,,20776.0,,5144.0,,20776.0,378243.0,57979.0,,,,-6180,
5640,2014.0,0.0,414684.0,0.0,18,415242,9671.0,0.0,0.0,10211,415242,9671,8.0,424913,0.0,0.0,415262,20.0,0.0,0,0.0,0.0,0.0,827022.0,,,,,,,,,,,,,95005.0,,827022.0,,95005.0,,,,,,843211,
2293,,,,6000.0,445,336288,252751.0,,,284851,26545,258751,,285296,,,336288,,0.0,0,,,,,,,,,,,,,,,,,330876.0,,0.0,0.0,330876.0,62794.0,50760.0,,,,61233,
610,1957.0,,19303.0,37250.0,392,276476,93365.0,,,128785,17865,130615,,148480,0.0,,276476,,,0,0.0,0.0,0.0,8286.0,,,,,,,,,,,,,119991.0,,8286.0,,119991.0,853314.0,760625.0,,84.0,,619,
7185,2016.0,0.0,0.0,0.0,0,3000000,0.0,0.0,0.0,3000000,3000000,0,0.0,3000000,0.0,0.0,3000000,0.0,0.0,0,0.0,0.0,0.0,,,,,,,,,,,,,,12443895.0,,,,12443895.0,,,,,,798345,
6652,1992.0,0.0,177807.0,208034.0,95653,2516289,22943.0,-4925.0,0.0,0,37558,230977,13.0,268535,41721.0,0.0,2516289,0.0,0.0,0,35660.0,0.0,631476.0,999752.0,24500.0,,,9039.0,24500.0,1770.0,1349.0,2440.0,,,,571291.0,,,1024252.0,,,43673.0,40035.0,,,,1146632,362457.0
9311,1920.0,,,,261,6870,156109.0,102974.0,,360147,-43560,506942,0.0,463382,,350833.0,148378,141508.0,0.0,1,6698.0,125000.0,,,,,,,,,,,,,,,458557.0,,,38783.0,458557.0,53354.0,52060.0,,,,-5930,
5131,1999.0,0.0,2252640.0,0.0,18075,1450955,1431266.0,0.0,0.0,0,119635,2151080,150.0,2270715,0.0,719814.0,1695768,244813.0,0.0,1,11796.0,127940.0,0.0,873228.0,1173190.0,,,44130.0,1173190.0,44130.0,,,,,,34308.0,,,2046418.0,,,266972.0,182140.0,,,,247575,-48095.0
2948,2001.0,153492.0,25257.0,,0,75442,,68918.0,,42988,-16329,153492,60.0,137163,0.0,,75442,0.0,0.0,0,0.0,0.0,0.0,51981.0,0.0,,0.0,41221.0,0.0,98514.0,,,0.0,,0.0,0.0,68218.0,0.0,51981.0,0.0,68218.0,89916.0,31201.0,0.0,0.0,-11984.0,43759,0.0


In [90]:
df[check_vars[check_vars['data_type_xsd'].isin(float_types)]['variable_name_new'].tolist()].describe().T

Unnamed: 0,count,unique,top,freq
F9_00_HD_YEAR_FORMED,1590812,320,2000,37797
F9_01_PC_BEN_PAID_MEMB_PRIOR,803660,62527,0,727311
F9_01_PC_CONTR_GRANTS_PRIOR,1489779,728074,0,191483
F9_01_PC_GRANTS_PRIOR,980072,252925,0,502900
F9_01_PC_INVEST_INCOME_PRIOR,1456806,299513,0,130308
F9_01_PC_NET_ASSETS_BOY,1705809,1236048,0,8086
F9_01_PC_OTHER_EXPENSE_PRIOR,1666525,896106,0,18796
F9_01_PC_OTHER_REV_PRIOR,1299060,355672,0,265671
F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,790628,37224,0,727734
F9_01_PC_PROG_SERVICE_REV_PRIOR,1367237,764528,0,216726


In [91]:
print(len(check_vars))
check_vars[check_vars['variable_name_new'].isin(float_cols)][42:]

83


Unnamed: 0,variable_name_new,sub_key,data_type_xsd
46,F9_07_PC_NUM_INDS_GREATER_100K,,CountType
47,F9_07_PC_TOT_OTHER_COMPENSATION,,USAmountType
48,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,,USAmountType
49,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,,USAmountType
50,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,,USAmountType
51,F9_08_PC_CONTS_REPRTD_FNDRAISNG,,USAmountType
52,F9_08_PC_COST_OF_GOODS_SOLD,,USAmountNNType
53,F9_08_PC_FEDERATED_CAMPAIGNS,,USAmountType
54,F9_08_PC_FUNDRAISING_DIRECT_EXP,,USAmountType
55,F9_08_PC_FUNDRAISING_EVENTS,,USAmountType


#### 

In [108]:
set(check_vars['variable_name_new'].tolist()) - set(string_cols)

set()

In [107]:
set(string_cols) - set(check_vars['variable_name_new'].tolist())

{'DLN',
 'EIN',
 'F9_00_HD_EXEMPT_STATUS_501C',
 'F9_00_HD_PRIN_OFF_NAME',
 'F9_00_HD_STATE_OF_DOMICILE',
 'F9_00_HD_TAX_PER_END',
 'F9_00_HD_TYPE_ORG_OTHER_DESC',
 'F9_00_HD_WEBSITE',
 'F9_01_PZ_ORGANIZATIONAL_MISSION',
 'F9_06_PC_STATES_WHERE_RET_FILED',
 'OrganizationName',
 'TaxPeriod',
 'URL'}

In [106]:
set(float_cols) - set(check_vars['variable_name_new'].tolist())

set()

In [105]:
set(check_vars['variable_name_new'].tolist()) - set(float_cols)

{'F9_00_HD_CTRY_OF_DOMICILE',
 'F9_00_HD_GROSS_EXEMPT_NUM',
 'F9_00_HD_INCLUDES_SUBORD_ORGS',
 'F9_00_HD_SPECIAL_CONDITION_DESC',
 'F9_12_PC_ACCTG_METHOD_OTHER'}

In [109]:
print(len(string_cols))
print(len(exclude_cols))
print(len(check_vars))
#print(len(integer_cols)) #THESE HAVE ALREADY BEEN REMOVED
print(len(float_cols))
float_cols = [col for col in string_cols if col not in exclude_cols]
print(len(float_cols))
print(float_cols)

96
18
83
46
78
78
['F9_00_HD_ADDR_CHANGE', 'F9_00_HD_AMENDED_RETURN', 'F9_00_HD_EXEMPT_STATUS_4847A1', 'F9_00_HD_EXEMPT_STATUS_501C3', 'F9_00_HD_FINAL_RETURN', 'F9_00_HD_INITIAL_RETURN', 'F9_00_HD_TYPE_ORG_ASSOCIATION', 'F9_00_HD_TYPE_ORG_CORP', 'F9_00_HD_TYPE_ORG_OTHER', 'F9_00_HD_TYPE_ORG_TRUST', 'F9_00_HD_YEAR_FORMED', 'F9_01_PC_BEN_PAID_MEMB_PRIOR', 'F9_01_PC_CONTR_GRANTS_PRIOR', 'F9_01_PC_GRANTS_PRIOR', 'F9_01_PC_INVEST_INCOME_PRIOR', 'F9_01_PC_NET_ASSETS_BOY', 'F9_01_PC_OTHER_EXPENSE_PRIOR', 'F9_01_PC_OTHER_REV_PRIOR', 'F9_01_PC_PROF_FUNDRISING_EXP_PRIOR', 'F9_01_PC_PROG_SERVICE_REV_PRIOR', 'F9_01_PC_REV_LESS_EXP_PRIOR', 'F9_01_PC_TERMINATION_CONTRACTION', 'F9_01_PC_TOT_EXP_PRIOR', 'F9_01_PC_TOT_INDIV_VOLUNTEERS', 'F9_01_PC_TOT_REVENUE_PRIOR', 'F9_01_PC_TOT_UBI_NET', 'F9_01_PZ_SALARIES_PRIOR', 'F9_01_PZ_TOT_ASSETS_BOY', 'F9_01_PZ_TOT_LIAB_BOY', 'F9_06_PC_ANNUAL_DISC_COVRD_PERS', 'F9_06_PC_CEO_COMPENSTN_PROCESS', 'F9_06_PC_FORM_AVAIL_OWN_WEBSITE', 'F9_06_PC_FORM_UPON_REQUEST', 'F9

#### Final verification of *float_cols*

In [110]:
df[float_cols].sample(10)

Unnamed: 0,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_INITIAL_RETURN,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_NET,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_LIAB_BOY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED
5952,,,,1.0,,,,,,,,,206019.0,,43.0,113523,189590,4108.0,,81070.0,-5088,,296328,50.0,291240,,106738.0,117545,4022.0,1.0,1,,1.0,,1,1.0,0,1.0,,,,0.0,0.0,,63480.0,,163617.0,,1524.0,,3420.0,,14350.0,,,,390.0,,47033.0,102832.0,,163617.0,0.0,102832.0,780.0,559.0,1.0,,,,,-24393,,,1.0,,,0
4737,,,,,,,,1.0,,,1979.0,1472384.0,1130390.0,17750.0,26087.0,2511727,415639,105678.0,97987.0,1114392.0,177617,,2198930,0.0,2376547,-17140.0,195170.0,2563668,51941.0,1.0,1,,1.0,,1,1.0,1,,,1.0,,1.0,0.0,0.0,103662.0,0.0,256421.0,,79411.0,,,,,,,,111262.0,880423.0,,1054323.0,,1136844.0,26357.0,1054323.0,716109.0,571902.0,1.0,,,,,122586,11169.0,1.0,,,,0
4230,,,,1.0,,,,1.0,,,2012.0,,1450117.0,336461.0,,681935,603945,2920.0,,,-36537,,1489574,40.0,1453037,,549168.0,688781,6846.0,1.0,1,,1.0,,1,1.0,1,,,,,,,38500.0,59000.0,,1103862.0,169897.0,,,28843.0,169897.0,8650.0,,,,,,,,,1273759.0,,,,,1.0,,,,,-212840,,1.0,,1.0,,0
7274,,,,1.0,,,,1.0,,,2007.0,0.0,967500.0,4170.0,74.0,80068,754894,0.0,0.0,359415.0,-890464,,2217453,0.0,1326989,0.0,1458389.0,434498,354430.0,1.0,0,,1.0,,1,1.0,0,,,,,0.0,0.0,0.0,44718.0,0.0,3846000.0,,,,,,,,,,,,,488376.0,,3846000.0,,488376.0,4091.0,4091.0,1.0,,,,,2117400,,1.0,,1.0,,0
2838,,,,1.0,,,,1.0,,,2004.0,,130876.0,,285.0,-1144378,342865,4950.0,,104697.0,-142243,,383051,2.0,240808,,40186.0,1801516,2945894.0,1.0,1,,1.0,,1,1.0,1,,,1.0,,,,47076.0,,683982.0,,,,,,,,,,130207.0,,,,96859.0,,130207.0,4737.0,96859.0,2862906.0,1361831.0,1.0,,,,,-170866,,1.0,,1.0,1.0,1
2221,,,,,,,,1.0,,,1979.0,,,26000.0,517.0,312995,76447,109917.0,,1100.0,9087,,102447,,111534,,,312995,,1.0,1,,1.0,,0,1.0,1,,,,1.0,,,,,,,,46255.0,,,,,,,,212613.0,975.0,,,,975.0,,,90367.0,90367.0,1.0,,,,,43022,,1.0,,1.0,,0
4735,,,,1.0,,,,1.0,,,2004.0,84267.0,519523.0,,273.0,106148,281351,,,,40333,,479463,5.0,519796,,113845.0,108771,2623.0,1.0,1,,1.0,0.0,1,1.0,1,,,0.0,,0.0,0.0,,100845.0,,429202.0,,,,,,,,,,,,,0.0,,429202.0,0.0,0.0,39082.0,25882.0,1.0,,,,,-36933,,,1.0,1.0,0.0,0
4175,,,,1.0,,,,1.0,,,2007.0,0.0,0.0,0.0,80.0,6214,187859,3061.0,0.0,772039.0,20934,,754246,0.0,775180,0.0,566387.0,33556,27342.0,,0,,1.0,,1,,0,,,,,0.0,0.0,0.0,31749.0,80437.0,,,,,,,,,,,,,,802370.0,,,4226.0,802370.0,7233.0,1611.0,,1.0,,,,-17753,,1.0,,1.0,,0
9208,,,,1.0,,,,1.0,,,1921.0,0.0,0.0,0.0,0.0,262457,0,0.0,0.0,0.0,0,,0,0.0,0,0.0,0.0,262457,0.0,,0,,1.0,,1,,0,,,,1.0,0.0,0.0,0.0,0.0,0.0,45491.0,,,,10199.0,,16658.0,,,50864.0,,,,,,96355.0,,,1434281.0,1054240.0,1.0,,,,,-9871,,,1.0,,,0
1386,,,,1.0,,,,1.0,,,,,27152.0,20722.0,1791.0,24179,5850,2400.0,,,4771,,26572,,31343,,,24179,,,0,,,,0,,0,,,,1.0,,,,,,15326.0,,,,,,,,,,,80.0,,,,15406.0,,,,,,1.0,,,,-10908,,,1.0,,,0


In [114]:
df[float_cols[:40]].describe().T

Unnamed: 0,count,unique,top,freq
F9_00_HD_ADDR_CHANGE,67749,1,1,67749
F9_00_HD_AMENDED_RETURN,14996,1,1,14996
F9_00_HD_EXEMPT_STATUS_4847A1,1439,1,1,1439
F9_00_HD_EXEMPT_STATUS_501C3,1278859,1,1,1278859
F9_00_HD_FINAL_RETURN,9129,1,1,9129
F9_00_HD_INITIAL_RETURN,16214,1,1,16214
F9_00_HD_TYPE_ORG_ASSOCIATION,76367,1,1,76367
F9_00_HD_TYPE_ORG_CORP,1517294,1,1,1517294
F9_00_HD_TYPE_ORG_OTHER,42243,1,1,42243
F9_00_HD_TYPE_ORG_TRUST,55786,1,1,55786


In [115]:
df[float_cols[40:]].describe().T

Unnamed: 0,count,unique,top,freq
F9_07_PC_NO_LISTED_PERS_COMPENSD,731409,1,1,731409
F9_07_PC_NUM_CONTRCTRS_GRTR_100K,1087806,837,0,853103
F9_07_PC_NUM_INDS_GREATER_100K,1224114,1884,0,856881
F9_07_PC_TOT_OTHER_COMPENSATION,1044996,203232,0,491845
F9_07_PC_TOT_REPRT_COMP_FROM_ORG,1288406,386948,0,393749
F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,947066,146753,0,738290
F9_08_PC_ALL_OTHER_CONTRIBUTIONS,1139009,547703,0,5786
F9_08_PC_CONTS_REPRTD_FNDRAISNG,285782,146941,0,23752
F9_08_PC_COST_OF_GOODS_SOLD,232584,123208,0,30569
F9_08_PC_FEDERATED_CAMPAIGNS,117182,59417,0,28546


#### Save lists and DF

In [112]:
import json
with open('string_cols.json', 'w') as fp:
    json.dump(string_cols, fp)
with open('integer_cols.json', 'w') as fp:
    json.dump(integer_cols, fp)    
with open('exclude_cols.json', 'w') as fp:
    json.dump(exclude_cols, fp)
with open('float_cols.json', 'w') as fp:
    json.dump(float_cols, fp)      

In [113]:
import timeit
start_time = timeit.default_timer()
df.to_pickle('all filings - with 185 newly named control variables (with parsed sub-key variables).pkl')
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 

# of minutes:  1.025451161666812


In [186]:
#df['F9_01_PZ_TOT_REV_CURR'] = df['F9_01_PZ_TOT_REV_CURR'].astype('float')

#### Convert *float_cols* to float

In [116]:
for c in float_cols:
    df[c] = df[c].astype('float')

#### Save DF

In [119]:
import timeit
start_time = timeit.default_timer()
df.to_pickle('all filings - with 185 newly named control variables (with parsed sub-key variables).pkl')
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 

# of minutes:  0.5845796366666036


# Descriptives for Numeric and String Variables

In [120]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
F9_09_PC_FEES_FOR_SVCE_FR_TOT,399676.0,1.895023e+04,2.353346e+05,-1.541000e+04,0.00,0.0,0.00,2.642141e+07
F9_00_HD_TAX_YEAR,1727056.0,2.014072e+03,2.405912e+00,2.009000e+03,2012.00,2014.0,2016.00,2.018000e+03
F9_00_HD_ADDR_CHANGE,67749.0,1.000000e+00,0.000000e+00,1.000000e+00,1.00,1.0,1.00,1.000000e+00
F9_00_HD_AMENDED_RETURN,14996.0,1.000000e+00,0.000000e+00,1.000000e+00,1.00,1.0,1.00,1.000000e+00
F9_00_HD_EXEMPT_STATUS_4847A1,1439.0,1.000000e+00,0.000000e+00,1.000000e+00,1.00,1.0,1.00,1.000000e+00
F9_00_HD_EXEMPT_STATUS_501C3,1278859.0,1.000000e+00,0.000000e+00,1.000000e+00,1.00,1.0,1.00,1.000000e+00
F9_00_HD_FINAL_RETURN,9129.0,1.000000e+00,0.000000e+00,1.000000e+00,1.00,1.0,1.00,1.000000e+00
F9_00_HD_GROSS_RCPT,1727056.0,1.595972e+07,5.185864e+08,0.000000e+00,235028.00,580695.5,2295982.50,3.105170e+11
F9_00_HD_GROUP_RETURN,1727056.0,2.448097e-03,4.941766e-02,0.000000e+00,0.00,0.0,0.00,1.000000e+00
F9_00_HD_INITIAL_RETURN,16214.0,1.000000e+00,0.000000e+00,1.000000e+00,1.00,1.0,1.00,1.000000e+00


In [122]:
string_cols = df.select_dtypes(include='object').columns.tolist()
print(len(string_cols))
print(string_cols, '\n')
df[string_cols].describe().T

18
['DLN', 'EIN', 'OrganizationName', 'TaxPeriod', 'URL', 'F9_00_HD_TAX_PER_END', 'F9_00_HD_CTRY_OF_DOMICILE', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_GROSS_EXEMPT_NUM', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_SPECIAL_CONDITION_DESC', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_WEBSITE', 'F9_01_PZ_ORGANIZATIONAL_MISSION', 'F9_06_PC_STATES_WHERE_RET_FILED', 'F9_12_PC_ACCTG_METHOD_OTHER'] 



Unnamed: 0,count,unique,top,freq
DLN,1727056,1727056,93493281007265,1
EIN,1727056,324343,943041314,18
OrganizationName,1727056,384946,SHRINERS INTERNATIONAL,433
TaxPeriod,1727056,118,201712,137886
URL,1727056,1727056,https://s3.amazonaws.com/irs-form-990/201201039349300405_public.xml,1
F9_00_HD_TAX_PER_END,1727056,724,2017-12-31,137859
F9_00_HD_CTRY_OF_DOMICILE,912,68,CA,212
F9_00_HD_EXEMPT_STATUS_501C,446758,46,"{'@organization501cTypeTxt': '6', '#text': 'X'}",87293
F9_00_HD_GROSS_EXEMPT_NUM,60429,1827,0928,10822
F9_00_HD_INCLUDES_SUBORD_ORGS,278169,58,false,265079


In [125]:
set(string_cols) - set(exclude_cols)

set()

In [124]:
set(exclude_cols) - set(string_cols)

set()

#### Save DF

In [123]:
import timeit
start_time = timeit.default_timer()
df.to_pickle('all filings - with 185 newly named control variables (with parsed sub-key variables).pkl')
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 

# of minutes:  0.5779472966668739


In [126]:
import timeit
start_time = timeit.default_timer()
df.to_csv('all filings - with 185 newly named control variables (with parsed sub-key variables).csv', index=False)
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 

# of minutes:  5.5354063383335115


<br>Export to Stata without the 3 problem columns plus two others

In [135]:
print(problem_cols)

['F9_00_HD_EXEMPT_STATUS_501C', 'F9_12_PC_ACCTG_METHOD_OTHER', 'F9_00_HD_INCLUDES_SUBORD_ORGS']


In [142]:
exclude = problem_cols + ['F9_00_HD_SPECIAL_CONDITION_DESC', 'F9_06_PC_STATES_WHERE_RET_FILED']

import timeit
start_time = timeit.default_timer()
df[[c for c in df.columns.tolist() if c not in exclude]].to_stata('all filings - with 185 newly named control variables (with parsed sub-key variables).dta', version=117)
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 

# of minutes:  2.5130808066666455


In [141]:
#SOME OF THESE ARE LISTS, WHICH STATA DOES NOT LIKE
df['F9_06_PC_STATES_WHERE_RET_FILED'].value_counts().head()

CA    119601
NY     91240
PA     46096
MA     44149
OH     40398
Name: F9_06_PC_STATES_WHERE_RET_FILED, dtype: int64

In [137]:
df['F9_00_HD_SPECIAL_CONDITION_DESC'].value_counts().head()

PUBLIC DISCLOSURE COPY         26
HURRICANE IRMA                 18
HURRICANE SANDY                12
EXTENSION GRANTED TO 111519    11
EXTENDED TO 8172015            11
Name: F9_00_HD_SPECIAL_CONDITION_DESC, dtype: int64

In [132]:
for p in problem_cols:
    print(df[p].value_counts().head(), '\n\n')

{'@organization501cTypeTxt': '6', '#text': 'X'}    87293
{'@organization501cTypeTxt': '4', '#text': 'X'}    44596
{'@organization501cTypeTxt': '5', '#text': 'X'}    38180
{'@organization501cTypeTxt': '7', '#text': 'X'}    35328
{'@typeOf501cOrganization': '6', '#text': 'X'}     32959
Name: F9_00_HD_EXEMPT_STATUS_501C, dtype: int64 


{'@methodOfAccountingOtherDesc': 'MODIFIED CASH', '#text': 'X'}          10407
X                                                                         3862
{'@note': 'MODIFIED CASH', '#text': 'X'}                                  3117
{'@methodOfAccountingOtherDesc': 'Modified Cash', '#text': 'X'}           2578
{'@methodOfAccountingOtherDesc': 'MODIFIED CASH BASIS', '#text': 'X'}     1469
Name: F9_12_PC_ACCTG_METHOD_OTHER, dtype: int64 


false                                                         265079
1                                                               6696
true                                                            4816
{'@referenc

# Create version with null values filled

In [7]:
pd.set_option('display.float_format', lambda x: '%.0f' % x)

In [8]:
missingcols = list(df.columns[df.isnull().any()])
print(len(missingcols))
print(missingcols)

134
['F9_09_PC_FEES_FOR_SVCE_FR_TOT', 'F9_00_HD_ADDR_CHANGE', 'F9_00_HD_AMENDED_RETURN', 'F9_00_HD_CTRY_OF_DOMICILE', 'F9_00_HD_EXEMPT_STATUS_4847A1', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_EXEMPT_STATUS_501C3', 'F9_00_HD_FINAL_RETURN', 'F9_00_HD_GROSS_EXEMPT_NUM', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 'F9_00_HD_INITIAL_RETURN', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_SPECIAL_CONDITION_DESC', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TYPE_ORG_ASSOCIATION', 'F9_00_HD_TYPE_ORG_CORP', 'F9_00_HD_TYPE_ORG_OTHER', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_TYPE_ORG_TRUST', 'F9_00_HD_WEBSITE', 'F9_00_HD_YEAR_FORMED', 'F9_01_PC_BEN_PAID_MEMB_PRIOR', 'F9_01_PC_CONTR_GRANTS_PRIOR', 'F9_01_PC_GRANTS_PRIOR', 'F9_01_PC_INVEST_INCOME_PRIOR', 'F9_01_PC_NET_ASSETS_BOY', 'F9_01_PC_OTHER_EXPENSE_PRIOR', 'F9_01_PC_OTHER_REV_PRIOR', 'F9_01_PC_PROF_FUNDRISING_EXP_PRIOR', 'F9_01_PC_PROG_SERVICE_REV_PRIOR', 'F9_01_PC_REV_LESS_EXP_PRIOR', 'F9_01_PC_TERMINATION_CONTRACTION', 'F9_01_PC_TOT_EXP_PRIOR', 'F9_01_PC_TOT_INDIV

<br>Descriptives for rows not missing data

In [145]:
df[[c for c in df.columns.tolist() if c not in missingcols]].describe(percentiles=[]).T

Unnamed: 0,count,mean,std,min,50%,max
F9_00_HD_TAX_YEAR,1727056,2014,2,2009,2014,2018
F9_00_HD_GROSS_RCPT,1727056,15959719,518586418,0,580696,310516974055
F9_00_HD_GROUP_RETURN,1727056,0,0,0,0,1
F9_01_PC_CONTR_GRANTS_CURR,1727056,1819022,26362481,-654611,112956,9265119609
F9_01_PC_INDEP_VOTING_MEMB,1727056,19,761,0,9,830201
F9_01_PC_PROF_FUNDRISING_EXP_CURR,1727056,4403,113729,-15410,0,26421406
F9_01_PC_REV_LESS_EXP_CURR,1727056,626690,41704576,-1839749722,10714,50245931778
F9_01_PC_TOT_ASSETS_EOY,1727056,22387952,391641946,-98344486,795666,90967341073
F9_01_PC_TOT_FNDR_EXP_CURR,1727056,90257,1257584,-16893234,0,233149738
F9_01_PC_TOT_INDIV_EMPLOYED,1727056,93,1792,0,3,908433


<br>Descriptives for rows missing data

In [146]:
df[missingcols].describe(percentiles=[]).T

Unnamed: 0,count,mean,std,min,50%,max
F9_09_PC_FEES_FOR_SVCE_FR_TOT,399676,18950,235335,-15410,0,26421406
F9_00_HD_ADDR_CHANGE,67749,1,0,1,1,1
F9_00_HD_AMENDED_RETURN,14996,1,0,1,1,1
F9_00_HD_EXEMPT_STATUS_4847A1,1439,1,0,1,1,1
F9_00_HD_EXEMPT_STATUS_501C3,1278859,1,0,1,1,1
F9_00_HD_FINAL_RETURN,9129,1,0,1,1,1
F9_00_HD_INITIAL_RETURN,16214,1,0,1,1,1
F9_00_HD_TYPE_ORG_ASSOCIATION,76367,1,0,1,1,1
F9_00_HD_TYPE_ORG_CORP,1517294,1,0,1,1,1
F9_00_HD_TYPE_ORG_OTHER,42243,1,0,1,1,1


In [200]:
#df.describe(percentiles=[]).T.to_excel('descriptives - perks (pre-filling).xlsx')

In [None]:
#descriptives = pd.DataFrame(df.describe().T)

##### Descriptives (String)
Show descriptives for a limited number of string variables

In [9]:
print(len(list(df.select_dtypes(include=['object']).columns)))
print(list(df.select_dtypes(include=['object']).columns))

18
['DLN', 'EIN', 'OrganizationName', 'TaxPeriod', 'URL', 'F9_00_HD_TAX_PER_END', 'F9_00_HD_CTRY_OF_DOMICILE', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_GROSS_EXEMPT_NUM', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_SPECIAL_CONDITION_DESC', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_WEBSITE', 'F9_01_PZ_ORGANIZATIONAL_MISSION', 'F9_06_PC_STATES_WHERE_RET_FILED', 'F9_12_PC_ACCTG_METHOD_OTHER']


In [None]:
#df[['BMF_NTMAJ5', 'BMF_FILER', 'BMF_ZFILER', 'F9_00_HD_STATE_OF_DOMICILE',]].describe(include=['O']).T

##### Select numerical columns

In [None]:
#print(len(list(df.select_dtypes(include=['float', 'int']).columns)))
#print(list(df.select_dtypes(include=['float', 'int']).columns))

### Check which of *missingcols* should be 'filled'

In [47]:
object_cols = list(df.select_dtypes(include=['object']).columns)
print(len(object_cols))
print(object_cols)

18
['DLN', 'EIN', 'OrganizationName', 'TaxPeriod', 'URL', 'F9_00_HD_TAX_PER_END', 'F9_00_HD_CTRY_OF_DOMICILE', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_GROSS_EXEMPT_NUM', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_SPECIAL_CONDITION_DESC', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_WEBSITE', 'F9_01_PZ_ORGANIZATIONAL_MISSION', 'F9_06_PC_STATES_WHERE_RET_FILED', 'F9_12_PC_ACCTG_METHOD_OTHER']


In [11]:
print(len(missingcols))
print(len(set(object_cols) - set(missingcols)))
print(set(object_cols) - set(missingcols))

134
7
{'OrganizationName', 'DLN', 'URL', 'TaxPeriod', 'F9_01_PZ_ORGANIZATIONAL_MISSION', 'EIN', 'F9_00_HD_TAX_PER_END'}


In [12]:
print(len(set(missingcols).intersection(set(object_cols))))
set(missingcols).intersection(set(object_cols))

11


{'F9_00_HD_CTRY_OF_DOMICILE',
 'F9_00_HD_EXEMPT_STATUS_501C',
 'F9_00_HD_GROSS_EXEMPT_NUM',
 'F9_00_HD_INCLUDES_SUBORD_ORGS',
 'F9_00_HD_PRIN_OFF_NAME',
 'F9_00_HD_SPECIAL_CONDITION_DESC',
 'F9_00_HD_STATE_OF_DOMICILE',
 'F9_00_HD_TYPE_ORG_OTHER_DESC',
 'F9_00_HD_WEBSITE',
 'F9_06_PC_STATES_WHERE_RET_FILED',
 'F9_12_PC_ACCTG_METHOD_OTHER'}

In [14]:
#import json
#f = open('exclude_cols.json', 'r')
#exclude_cols = json.load(f)
#exclude_cols = [str(t) for t in exclude_cols]
#print(len(exclude_cols))
#print(exclude_cols)

18
['DLN', 'EIN', 'OrganizationName', 'TaxPeriod', 'URL', 'F9_00_HD_TAX_PER_END', 'F9_01_PZ_ORGANIZATIONAL_MISSION', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_WEBSITE', 'F9_06_PC_STATES_WHERE_RET_FILED', 'F9_12_PC_ACCTG_METHOD_OTHER', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 'F9_00_HD_CTRY_OF_DOMICILE', 'F9_00_HD_GROSS_EXEMPT_NUM', 'F9_00_HD_SPECIAL_CONDITION_DESC']


In [15]:
print(len(exclude_cols))
set(exclude_cols) - set(['F9_00_HD_CTRY_OF_DOMICILE',
 'F9_00_HD_EXEMPT_STATUS_501C',
 'F9_00_HD_GROSS_EXEMPT_NUM',
 'F9_00_HD_INCLUDES_SUBORD_ORGS',
 'F9_00_HD_PRIN_OFF_NAME',
 'F9_00_HD_SPECIAL_CONDITION_DESC',
 'F9_00_HD_STATE_OF_DOMICILE',
 'F9_00_HD_TYPE_ORG_OTHER_DESC',
 'F9_00_HD_WEBSITE',
 'F9_06_PC_STATES_WHERE_RET_FILED',
 'F9_12_PC_ACCTG_METHOD_OTHER'])

18


{'DLN',
 'EIN',
 'F9_00_HD_TAX_PER_END',
 'F9_01_PZ_ORGANIZATIONAL_MISSION',
 'OrganizationName',
 'TaxPeriod',
 'URL'}

In [16]:
set(['F9_00_HD_CTRY_OF_DOMICILE',
 'F9_00_HD_EXEMPT_STATUS_501C',
 'F9_00_HD_GROSS_EXEMPT_NUM',
 'F9_00_HD_INCLUDES_SUBORD_ORGS',
 'F9_00_HD_PRIN_OFF_NAME',
 'F9_00_HD_SPECIAL_CONDITION_DESC',
 'F9_00_HD_STATE_OF_DOMICILE',
 'F9_00_HD_TYPE_ORG_OTHER_DESC',
 'F9_00_HD_WEBSITE',
 'F9_06_PC_STATES_WHERE_RET_FILED',
 'F9_12_PC_ACCTG_METHOD_OTHER']) - set(exclude_cols)

set()

In [17]:
set(exclude_cols) - set(object_cols)

set()

In [18]:
set(object_cols) - set(exclude_cols)

set()

In [19]:
print(len(set(missingcols) - set(object_cols)))
print(set(missingcols) - set(object_cols))

123
{'F9_07_PC_NUM_INDS_GREATER_100K', 'F9_01_PC_GRANTS_PRIOR', 'F9_09_PC_COMP_OFFICERS_FUNDRAISE', 'F9_01_PZ_TOT_ASSETS_BOY', 'F9_00_HD_ADDR_CHANGE', 'F9_08_PC_FUNDRAISING_DIRECT_EXP', 'F9_09_PC_FEES_FOR_SVCE_INVST_TOT', 'F9_10_PC_RET_EARNINGS_ENDWMT_EOY', 'F9_08_PC_CONTS_REPRTD_FNDRAISNG', 'F9_00_HD_TYPE_ORG_OTHER', 'F9_00_HD_INITIAL_RETURN', 'F9_12_PC_ACCTG_METHOD_CASH', 'F9_09_PC_FEES_FOR_SVCE_OTH_TOT', 'F9_08_PC_NONCASH_CONTRIBUTIONS', 'F9_09_PC_FEES_FOR_SVCE_ACCT_TOT', 'F9_08_PC_GROSS_SALES_INVENTORY', 'F9_09_PC_TOTAL_FUNDRAISE_EXPENSE', 'F9_09_PC_COMP_DISQUAL_PROG_SVCE', 'F9_00_HD_TYPE_ORG_ASSOCIATION', 'F9_09_PC_PAYROLL_TAX_PROG_SVCE', 'F9_10_PC_CASH_NON_INTEREST_BOY', 'F9_10_PC_LOANS_FROM_OFFICERS_EOY', 'F9_08_PC_FEDERATED_CAMPAIGNS', 'F9_01_PC_REV_LESS_EXP_PRIOR', 'F9_10_PC_BOND_LIABILITIES_EOY', 'F9_09_PC_PENSION_CONT_FUNDRAISE', 'F9_09_PC_OTHER_EMP_BEN_PROG_SVCE', 'F9_07_PC_NUM_CONTRCTRS_GRTR_100K', 'F9_11_PC_RECNCLTN_DONATED_SVCES', 'F9_12_PC_FED_GRNT_AUDIT_PERFORMD', 'F9_

In [20]:
fill_cols = [col for col in missingcols if col not in exclude_cols]
print(len(fill_cols))
print(fill_cols)

123
['F9_09_PC_FEES_FOR_SVCE_FR_TOT', 'F9_00_HD_ADDR_CHANGE', 'F9_00_HD_AMENDED_RETURN', 'F9_00_HD_EXEMPT_STATUS_4847A1', 'F9_00_HD_EXEMPT_STATUS_501C3', 'F9_00_HD_FINAL_RETURN', 'F9_00_HD_INITIAL_RETURN', 'F9_00_HD_TYPE_ORG_ASSOCIATION', 'F9_00_HD_TYPE_ORG_CORP', 'F9_00_HD_TYPE_ORG_OTHER', 'F9_00_HD_TYPE_ORG_TRUST', 'F9_00_HD_YEAR_FORMED', 'F9_01_PC_BEN_PAID_MEMB_PRIOR', 'F9_01_PC_CONTR_GRANTS_PRIOR', 'F9_01_PC_GRANTS_PRIOR', 'F9_01_PC_INVEST_INCOME_PRIOR', 'F9_01_PC_NET_ASSETS_BOY', 'F9_01_PC_OTHER_EXPENSE_PRIOR', 'F9_01_PC_OTHER_REV_PRIOR', 'F9_01_PC_PROF_FUNDRISING_EXP_PRIOR', 'F9_01_PC_PROG_SERVICE_REV_PRIOR', 'F9_01_PC_REV_LESS_EXP_PRIOR', 'F9_01_PC_TERMINATION_CONTRACTION', 'F9_01_PC_TOT_EXP_PRIOR', 'F9_01_PC_TOT_INDIV_VOLUNTEERS', 'F9_01_PC_TOT_REVENUE_PRIOR', 'F9_01_PC_TOT_UBI_NET', 'F9_01_PZ_SALARIES_PRIOR', 'F9_01_PZ_TOT_ASSETS_BOY', 'F9_01_PZ_TOT_LIAB_BOY', 'F9_06_PC_ANNUAL_DISC_COVRD_PERS', 'F9_06_PC_CEO_COMPENSTN_PROCESS', 'F9_06_PC_FORM_AVAIL_OWN_WEBSITE', 'F9_06_PC_FORM

### Write function

In [21]:
def fillnull(var):
    #print(df[var].value_counts().to_frame().head(), '\n')
    print('# of missing observations in %s before processing:' % var, len(df[df[var].isnull()]))
    
    df[var] = np.where(df[var].isnull(), 0, df[var])
    
    #print(df[var].value_counts().to_frame().head(), '\n')
    print('# of missing observations in %s after processing:' % var, len(df[df[var].isnull()]), '\n')
    return df.sample(5)[['URL', var]]
    #print(df[[newvar, var1, var2, 'ObjectId']][:5], '\n\n\n')

In [22]:
for c in fill_cols[:]:
    fillnull(c)

# of missing observations in F9_09_PC_FEES_FOR_SVCE_FR_TOT before processing: 1327380
# of missing observations in F9_09_PC_FEES_FOR_SVCE_FR_TOT after processing: 0 

# of missing observations in F9_00_HD_ADDR_CHANGE before processing: 1659307
# of missing observations in F9_00_HD_ADDR_CHANGE after processing: 0 

# of missing observations in F9_00_HD_AMENDED_RETURN before processing: 1712060
# of missing observations in F9_00_HD_AMENDED_RETURN after processing: 0 

# of missing observations in F9_00_HD_EXEMPT_STATUS_4847A1 before processing: 1725617
# of missing observations in F9_00_HD_EXEMPT_STATUS_4847A1 after processing: 0 

# of missing observations in F9_00_HD_EXEMPT_STATUS_501C3 before processing: 448197
# of missing observations in F9_00_HD_EXEMPT_STATUS_501C3 after processing: 0 

# of missing observations in F9_00_HD_FINAL_RETURN before processing: 1717927
# of missing observations in F9_00_HD_FINAL_RETURN after processing: 0 

# of missing observations in F9_00_HD_INITIAL_R

# of missing observations in F9_08_PC_FUNDRAISING_DIRECT_EXP before processing: 1258141
# of missing observations in F9_08_PC_FUNDRAISING_DIRECT_EXP after processing: 0 

# of missing observations in F9_08_PC_FUNDRAISING_EVENTS before processing: 1423260
# of missing observations in F9_08_PC_FUNDRAISING_EVENTS after processing: 0 

# of missing observations in F9_08_PC_FUNDRAISING_GROSS_INC before processing: 1239441
# of missing observations in F9_08_PC_FUNDRAISING_GROSS_INC after processing: 0 

# of missing observations in F9_08_PC_GAMING_DIRECT_EXPENSES before processing: 1634833
# of missing observations in F9_08_PC_GAMING_DIRECT_EXPENSES after processing: 0 

# of missing observations in F9_08_PC_GAMING_GROSS_INCOME before processing: 1629985
# of missing observations in F9_08_PC_GAMING_GROSS_INCOME after processing: 0 

# of missing observations in F9_08_PC_GOVERNMENT_GRANTS before processing: 1245328
# of missing observations in F9_08_PC_GOVERNMENT_GRANTS after processing: 0 



# of missing observations in F9_10_PC_CASH_NON_INTEREST_EOY before processing: 161440
# of missing observations in F9_10_PC_CASH_NON_INTEREST_EOY after processing: 0 

# of missing observations in F9_10_PC_LAND_BLDG_EQPMT before processing: 442800
# of missing observations in F9_10_PC_LAND_BLDG_EQPMT after processing: 0 

# of missing observations in F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN before processing: 495892
# of missing observations in F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN after processing: 0 

# of missing observations in F9_10_PC_LOANS_FROM_OFFICERS_EOY before processing: 1638290
# of missing observations in F9_10_PC_LOANS_FROM_OFFICERS_EOY after processing: 0 

# of missing observations in F9_10_PC_ORG_FOLLOWS_SFAS117 before processing: 406153
# of missing observations in F9_10_PC_ORG_FOLLOWS_SFAS117 after processing: 0 

# of missing observations in F9_10_PC_ORG_NOT_FOLLOW_SFAS117 before processing: 1327645
# of missing observations in F9_10_PC_ORG_NOT_FOLLOW_SFAS117 after processing: 

In [23]:
df[fill_cols].describe(percentiles=[]).T

Unnamed: 0,count,mean,std,min,50%,max
F9_09_PC_FEES_FOR_SVCE_FR_TOT,1727056,4385,113492,-15410,0,26421406
F9_00_HD_ADDR_CHANGE,1727056,0,0,0,0,1
F9_00_HD_AMENDED_RETURN,1727056,0,0,0,0,1
F9_00_HD_EXEMPT_STATUS_4847A1,1727056,0,0,0,0,1
F9_00_HD_EXEMPT_STATUS_501C3,1727056,1,0,0,1,1
F9_00_HD_FINAL_RETURN,1727056,0,0,0,0,1
F9_00_HD_INITIAL_RETURN,1727056,0,0,0,0,1
F9_00_HD_TYPE_ORG_ASSOCIATION,1727056,0,0,0,0,1
F9_00_HD_TYPE_ORG_CORP,1727056,1,0,0,1,1
F9_00_HD_TYPE_ORG_OTHER,1727056,0,0,0,0,1


#### Save DF

In [6]:
#import timeit
#start_time = timeit.default_timer()
#df = pd.read_pickle('all filings - with 185 newly named control variables (with parsed sub-key variables).pkl')
#print('# of columns:', len(df.columns))
#print('# of observations:', len(df))
#elapsed = timeit.default_timer() - start_time
#print('# of minutes: ', elapsed/60) 
#df[:2]

# of columns: 190
# of observations: 1727056
# of minutes:  0.3791399100000002


Unnamed: 0,DLN,EIN,F9_09_PC_FEES_FOR_SVCE_FR_TOT,OrganizationName,TaxPeriod,URL,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,F9_10_PC_UNSECURED_NOTES_BOY
0,93493313013011,232705170,,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,201012,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,2010-12-31,2010,1.0,,,,,1.0,,,1473903.0,0,,,MICHAEL ANTON,,PA,,1.0,,,,,1992.0,0.0,1439340.0,1044925.0,638637.0,10,30447.0,1753405.0,243131.0,0.0,0,0.0,0.0,89152.0,193604.0,,2440859.0,881768.0,195892,0,0.0,450430.0,1075372.0,0,0.0,10,0.0,925000.0,33563.0,1990429.0,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,459751.0,1000,0.0,0.0,0.0,1925215.0,1384751.0,171810.0,1473903.0,1,1.0,0.0,0,1,0,0,0,0,0,0,,1.0,0,,0,0,0,1.0,1,1.0,10,10,0,0.0,,,,"[PA, NJ, DE]",0,0,0,1.0,0.0,0.0,0,0.0,0.0,0.0,1439340.0,,,,,,,,,,,,,,,1439340.0,1000.0,,1473903.0,,,,,,,,,21675.0,,215.0,,,,,,,,,,,,,,,,,,,,1384751.0,195892.0,145115.0,1043744.0,,,,256845.0,86228.0,,1.0,,240077.0,,332660.0,270700.0,,,2440859.0,,,,89152.0,,0,1.0,,,1.0,,0.0,1,
1,93493313013111,581805618,0.0,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,201106,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,2011-06-30,2010,,,,,,1.0,,1736.0,266420.0,0,False,,,,WY,,1.0,,,,,1993.0,,0.0,,,13,1425.0,1437850.0,189785.0,,0,,222839.0,-39085.0,-36926.0,,1433342.0,261190.0,0,0,,34577.0,224264.0,0,,19,0.0,0.0,828.0,1398765.0,PROVIDE HOUSING FOR THE ELDERLY AND THE DISABLED UNDER SECTION 202 OF THE NATIONAL HOUSING ACT UNDER AN AGREEMENT WITH THE DEPARTMENT OF HUD.,222550.0,0,265592.0,82955.0,71405.0,1455332.0,305505.0,17482.0,266420.0,1,1.0,1.0,0,1,1,1,1,0,1,1,,1.0,0,0.0,0,0,0,1.0,1,1.0,13,19,0,1.0,,,0.0,,0,0,1,,0.0,0.0,1,411648.0,,1180355.0,,,,,,,,,,,,,,265592.0,,0.0,0.0,265592.0,266420.0,,,,0.0,,,,0.0,7500.0,0.0,0.0,0.0,21600.0,0.0,,,17714.0,17714.0,,,59440.0,59440.0,,,5801.0,5801.0,,,,0.0,305505.0,0.0,29100.0,276405.0,,250.0,22261.0,2187206.0,904332.0,,1.0,,11349.0,,,0.0,7035.0,,1433342.0,,,,-39085.0,,0,1.0,,,1.0,1.0,1.0,1,


In [24]:
import timeit
start_time = timeit.default_timer()
df.to_pickle('all filings - with 185 newly named control variables (with parsed sub-key variables and fillnull).pkl')
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 

# of minutes:  0.6010813583333345


#### View and save descriptives for all variables

In [25]:
df.describe(percentiles=[]).T

Unnamed: 0,count,mean,std,min,50%,max
F9_09_PC_FEES_FOR_SVCE_FR_TOT,1727056,4385,113492,-15410,0,26421406
F9_00_HD_TAX_YEAR,1727056,2014,2,2009,2014,2018
F9_00_HD_ADDR_CHANGE,1727056,0,0,0,0,1
F9_00_HD_AMENDED_RETURN,1727056,0,0,0,0,1
F9_00_HD_EXEMPT_STATUS_4847A1,1727056,0,0,0,0,1
F9_00_HD_EXEMPT_STATUS_501C3,1727056,1,0,0,1,1
F9_00_HD_FINAL_RETURN,1727056,0,0,0,0,1
F9_00_HD_GROSS_RCPT,1727056,15959719,518586418,0,580696,310516974055
F9_00_HD_GROUP_RETURN,1727056,0,0,0,0,1
F9_00_HD_INITIAL_RETURN,1727056,0,0,0,0,1


### Convert variables to integer

In [43]:
df['F9_00_HD_AMENDED_RETURN'].dropna().value_counts().index.isin([0,1]).all()

True

In [51]:
float_cols = list(df.select_dtypes(include=['float']).columns)
print(len(float_cols))
print(float_cols[:5])

140
['F9_09_PC_FEES_FOR_SVCE_FR_TOT', 'F9_00_HD_ADDR_CHANGE', 'F9_00_HD_AMENDED_RETURN', 'F9_00_HD_EXEMPT_STATUS_4847A1', 'F9_00_HD_EXEMPT_STATUS_501C3']


In [53]:
bool_cols = [col for col in float_cols[:] if 
               df[col].dropna().value_counts().index.isin([0,1]).all()]
bool_cols

['F9_00_HD_ADDR_CHANGE',
 'F9_00_HD_AMENDED_RETURN',
 'F9_00_HD_EXEMPT_STATUS_4847A1',
 'F9_00_HD_EXEMPT_STATUS_501C3',
 'F9_00_HD_FINAL_RETURN',
 'F9_00_HD_INITIAL_RETURN',
 'F9_00_HD_TYPE_ORG_ASSOCIATION',
 'F9_00_HD_TYPE_ORG_CORP',
 'F9_00_HD_TYPE_ORG_OTHER',
 'F9_00_HD_TYPE_ORG_TRUST',
 'F9_01_PC_TERMINATION_CONTRACTION',
 'F9_06_PC_ANNUAL_DISC_COVRD_PERS',
 'F9_06_PC_CEO_COMPENSTN_PROCESS',
 'F9_06_PC_FORM_AVAIL_OWN_WEBSITE',
 'F9_06_PC_FORM_UPON_REQUEST',
 'F9_06_PC_JOINT_VENTURE_POLICY',
 'F9_06_PC_MINUTES_COMMITTEES',
 'F9_06_PC_MONITORING_OF_COI_POLICY',
 'F9_06_PC_OTHER_COMPENSTN_PROCESS',
 'F9_06_PC_OTHER_WEBSITE',
 'F9_06_PC_OWN_WEBSITE',
 'F9_06_PC_POLICIES_GOVERN_CHAPTER',
 'F9_07_PC_NO_LISTED_PERS_COMPENSD',
 'F9_10_PC_ORG_FOLLOWS_SFAS117',
 'F9_10_PC_ORG_NOT_FOLLOW_SFAS117',
 'F9_12_PC_ACCTG_METHOD_ACCRUAL',
 'F9_12_PC_ACCTG_METHOD_CASH',
 'F9_12_PC_AUDIT_COMMITTEE',
 'F9_12_PC_FED_GRNT_AUDIT_PERFORMD',
 'F9_12_PC_FED_GRNT_AUDIT_REQUIRED']

In [55]:
df[bool_cols].describe(percentiles=[]).T

Unnamed: 0,count,mean,std,min,50%,max
F9_00_HD_ADDR_CHANGE,1727056,0,0,0,0,1
F9_00_HD_AMENDED_RETURN,1727056,0,0,0,0,1
F9_00_HD_EXEMPT_STATUS_4847A1,1727056,0,0,0,0,1
F9_00_HD_EXEMPT_STATUS_501C3,1727056,1,0,0,1,1
F9_00_HD_FINAL_RETURN,1727056,0,0,0,0,1
F9_00_HD_INITIAL_RETURN,1727056,0,0,0,0,1
F9_00_HD_TYPE_ORG_ASSOCIATION,1727056,0,0,0,0,1
F9_00_HD_TYPE_ORG_CORP,1727056,1,0,0,1,1
F9_00_HD_TYPE_ORG_OTHER,1727056,0,0,0,0,1
F9_00_HD_TYPE_ORG_TRUST,1727056,0,0,0,0,1


In [57]:
for b in bool_cols:
    df[b] = df[b].astype('int')

In [58]:
df[bool_cols].dtypes

F9_00_HD_ADDR_CHANGE                 int32
F9_00_HD_AMENDED_RETURN              int32
F9_00_HD_EXEMPT_STATUS_4847A1        int32
F9_00_HD_EXEMPT_STATUS_501C3         int32
F9_00_HD_FINAL_RETURN                int32
F9_00_HD_INITIAL_RETURN              int32
F9_00_HD_TYPE_ORG_ASSOCIATION        int32
F9_00_HD_TYPE_ORG_CORP               int32
F9_00_HD_TYPE_ORG_OTHER              int32
F9_00_HD_TYPE_ORG_TRUST              int32
F9_01_PC_TERMINATION_CONTRACTION     int32
F9_06_PC_ANNUAL_DISC_COVRD_PERS      int32
F9_06_PC_CEO_COMPENSTN_PROCESS       int32
F9_06_PC_FORM_AVAIL_OWN_WEBSITE      int32
F9_06_PC_FORM_UPON_REQUEST           int32
F9_06_PC_JOINT_VENTURE_POLICY        int32
F9_06_PC_MINUTES_COMMITTEES          int32
F9_06_PC_MONITORING_OF_COI_POLICY    int32
F9_06_PC_OTHER_COMPENSTN_PROCESS     int32
F9_06_PC_OTHER_WEBSITE               int32
F9_06_PC_OWN_WEBSITE                 int32
F9_06_PC_POLICIES_GOVERN_CHAPTER     int32
F9_07_PC_NO_LISTED_PERS_COMPENSD     int32
F9_10_PC_OR

#### Save DF

In [59]:
import timeit
start_time = timeit.default_timer()
df.to_pickle('all filings - with 185 newly named control variables (with parsed sub-key variables and fillnull).pkl')
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 

# of minutes:  0.5522234283333318


### Fix *problem_cols*

In [60]:
problem_cols = ['F9_00_HD_EXEMPT_STATUS_501C', 'F9_12_PC_ACCTG_METHOD_OTHER', 'F9_00_HD_INCLUDES_SUBORD_ORGS']
exclude = problem_cols + ['F9_00_HD_SPECIAL_CONDITION_DESC', 'F9_06_PC_STATES_WHERE_RET_FILED']

In [63]:
for col in problem_cols:
    print(df[col].value_counts().head(10), '\n\n')

{'@organization501cTypeTxt': '6', '#text': 'X'}    87293
{'@organization501cTypeTxt': '4', '#text': 'X'}    44596
{'@organization501cTypeTxt': '5', '#text': 'X'}    38180
{'@organization501cTypeTxt': '7', '#text': 'X'}    35328
{'@typeOf501cOrganization': '6', '#text': 'X'}     32959
{'@typeOf501cOrganization': '3', '#text': 'X'}     28339
{'@organization501cTypeTxt': '9', '#text': 'X'}    19818
{'@typeOf501cOrganization': '4', '#text': 'X'}     16427
{'@organization501cTypeTxt': '8', '#text': 'X'}    14776
{'@typeOf501cOrganization': '5', '#text': 'X'}     14534
Name: F9_00_HD_EXEMPT_STATUS_501C, dtype: int64 


{'@methodOfAccountingOtherDesc': 'MODIFIED CASH', '#text': 'X'}          10407
X                                                                         3862
{'@note': 'MODIFIED CASH', '#text': 'X'}                                  3117
{'@methodOfAccountingOtherDesc': 'Modified Cash', '#text': 'X'}           2578
{'@methodOfAccountingOtherDesc': 'MODIFIED CASH BASIS', '#text'

In [106]:
def func(x, key1, key2):
    if pd.isnull(x):
        return np.nan
    #else: 
    #    mydict = ast.literal_eval(x)
    elif key1 in x.keys():
        return x[key1]
    elif key2 in x.keys():
        return x[key2]
    else:
        return np.nan

#### F9_00_HD_EXEMPT_STATUS_501C

In [79]:
print(len(df[df['F9_00_HD_EXEMPT_STATUS_501C'].notnull()]))

446758


In [94]:
"""
key1 = '@organization501cTypeTxt'
key2 = '@typeOf501cOrganization'
for index, row in df[df['F9_00_HD_EXEMPT_STATUS_501C'].notnull()][:1000].iterrows():
    #print(type(row['F9_00_HD_EXEMPT_STATUS_501C']), row['F9_00_HD_EXEMPT_STATUS_501C'])
    if type(row['F9_00_HD_EXEMPT_STATUS_501C']) == dict:
        if key1 in row['F9_00_HD_EXEMPT_STATUS_501C']:
            #print(row['F9_00_HD_EXEMPT_STATUS_501C'][key1])
            df.loc[index, 'test'] = row['F9_00_HD_EXEMPT_STATUS_501C'][key1]
            
        elif key2 in row['F9_00_HD_EXEMPT_STATUS_501C']:
            #print(row['F9_00_HD_EXEMPT_STATUS_501C'][key2])
            df.loc[index, 'test'] = row['F9_00_HD_EXEMPT_STATUS_501C'][key2]
        else:
            print('SOME OTHER KEY IS THERE')
"""

"\nkey1 = '@organization501cTypeTxt'\nkey2 = '@typeOf501cOrganization'\nfor index, row in df[df['F9_00_HD_EXEMPT_STATUS_501C'].notnull()][:1000].iterrows():\n    #print(type(row['F9_00_HD_EXEMPT_STATUS_501C']), row['F9_00_HD_EXEMPT_STATUS_501C'])\n    if type(row['F9_00_HD_EXEMPT_STATUS_501C']) == dict:\n        if key1 in row['F9_00_HD_EXEMPT_STATUS_501C']:\n            #print(row['F9_00_HD_EXEMPT_STATUS_501C'][key1])\n            df.loc[index, 'test'] = row['F9_00_HD_EXEMPT_STATUS_501C'][key1]\n            \n        elif key2 in row['F9_00_HD_EXEMPT_STATUS_501C']:\n            #print(row['F9_00_HD_EXEMPT_STATUS_501C'][key2])\n            df.loc[index, 'test'] = row['F9_00_HD_EXEMPT_STATUS_501C'][key2]\n        else:\n            print('SOME OTHER KEY IS THERE')\n"

In [68]:
df[df['F9_00_HD_EXEMPT_STATUS_501C'].notnull()][['F9_00_HD_EXEMPT_STATUS_501C']][:1]

Unnamed: 0,F9_00_HD_EXEMPT_STATUS_501C
96,"{'@typeOf501cOrganization': '7', '#text': 'X'}"


In [107]:
df['F9_00_HD_EXEMPT_STATUS_501C__SAFE'] = df['F9_00_HD_EXEMPT_STATUS_501C']

In [108]:
import timeit
start_time = timeit.default_timer()

for col in problem_cols[:1]:
    variable = col
    key1 = '@organization501cTypeTxt'
    key2 = '@typeOf501cOrganization'
    print(variable, key1, key2)
    #print(type(row['variable_name_new']))
    #df.loc[df.index[index], row['variable_name_new']] = 
    #df.loc[df.index[45], 'reptrak100-rank-2013 (binary)'] = 0
    
    df['F9_00_HD_EXEMPT_STATUS_501C'] = df[variable][:].apply(func, key1=key1, key2=key2)
    #df['test'] = df[variable].astype('float')
    
    
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 

F9_00_HD_EXEMPT_STATUS_501C @organization501cTypeTxt @typeOf501cOrganization
# of minutes:  0.0644229949999802


In [111]:
df['F9_00_HD_EXEMPT_STATUS_501C'].value_counts().head()

6    120252
4     61023
5     52714
7     48778
9     29886
Name: F9_00_HD_EXEMPT_STATUS_501C, dtype: int64

In [100]:
df['test'] = df['test'].astype('float')

In [110]:
df['F9_00_HD_EXEMPT_STATUS_501C'].describe().T

count     446758
unique        24
top            6
freq      120252
Name: F9_00_HD_EXEMPT_STATUS_501C, dtype: object

In [105]:
df[df['test'].notnull()][['test']][:10]

Unnamed: 0,test
96,7
97,4
98,6
99,9
100,13
101,6
102,5
103,8
104,9
105,2


In [103]:
for index, row in df[df['test'].notnull()][:10].iterrows():
    print(type(row['test']), row['test'])

<class 'float'> 7.0
<class 'float'> 4.0
<class 'float'> 6.0
<class 'float'> 9.0
<class 'float'> 13.0
<class 'float'> 6.0
<class 'float'> 5.0
<class 'float'> 8.0
<class 'float'> 9.0
<class 'float'> 2.0


##### F9_12_PC_ACCTG_METHOD_OTHER

In [112]:
df['F9_12_PC_ACCTG_METHOD_OTHER'].value_counts().head()

{'@methodOfAccountingOtherDesc': 'MODIFIED CASH', '#text': 'X'}          10407
X                                                                         3862
{'@note': 'MODIFIED CASH', '#text': 'X'}                                  3117
{'@methodOfAccountingOtherDesc': 'Modified Cash', '#text': 'X'}           2578
{'@methodOfAccountingOtherDesc': 'MODIFIED CASH BASIS', '#text': 'X'}     1469
Name: F9_12_PC_ACCTG_METHOD_OTHER, dtype: int64

In [120]:
def func_mixed(x, key1, key2):
    if pd.isnull(x):
        return np.nan
    #else: 
    #    mydict = ast.literal_eval(x)
    elif type(x)!=dict:
        return x.lower()
    elif key1 in x.keys():
        return x[key1].lower()
    elif key2 in x.keys():
        return x[key2].lower()
    else:
        return np.nan

In [131]:
df['F9_12_PC_ACCTG_METHOD_OTHER__SAFE'] = df['F9_12_PC_ACCTG_METHOD_OTHER']

In [132]:
import timeit
start_time = timeit.default_timer()

for col in problem_cols[1:2]:
    variable = col
    key1 = '@methodOfAccountingOtherDesc'
    key2 = '@note'
    print(variable, key1, key2)
    #print(type(row['variable_name_new']))
    #df.loc[df.index[index], row['variable_name_new']] = 
    #df.loc[df.index[45], 'reptrak100-rank-2013 (binary)'] = 0
    
    df['F9_12_PC_ACCTG_METHOD_OTHER__description'] = df[variable][:].apply(func_mixed, key1=key1, key2=key2)
    #df['test'] = df[variable].astype('float')
    
    
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 

F9_12_PC_ACCTG_METHOD_OTHER @methodOfAccountingOtherDesc @note
# of minutes:  0.029805014999995667


In [133]:
df['F9_12_PC_ACCTG_METHOD_OTHER'] = np.where(df['F9_12_PC_ACCTG_METHOD_OTHER'].notnull(), 1, 0)
df['F9_12_PC_ACCTG_METHOD_OTHER'].value_counts()

0    1691636
1      35420
Name: F9_12_PC_ACCTG_METHOD_OTHER, dtype: int64

In [134]:
df['F9_12_PC_ACCTG_METHOD_OTHER__description'].value_counts().head(50)

modified cash          18536
x                       3862
modified cash basis     3146
hybrid                  1327
modified accrual        1117
modified cas             851
mod cash                 578
mod. cash                450
mod accrual              433
modified                 356
modified accrua          303
regulatory               244
income tax               242
statutory                224
tax basis                177
see sch o                167
income tax basis         141
mod. accrual             108
hybred                   106
modified cash b           97
modified acc              96
mod cash basis            88
modifiedaccrual           77
ocboa                     72
fund accounting           66
see schedule o            64
modifed cash              63
modifiedcash              60
cashaccrual               59
modified accr             54
modified accrl            50
accrual                   49
modified-cash             48
modified ca               45
cash          

In [137]:
#df = df.drop('test', 1)
#df = df.drop('F9_00_HD_EXEMPT_STATUS_501C__SAFE', 1)
#df = df.drop('F9_12_PC_ACCTG_METHOD_OTHER__SAFE', 1)

#### F9_00_HD_INCLUDES_SUBORD_ORGS

In [143]:
df['F9_00_HD_INCLUDES_SUBORD_ORGS'].value_counts().head(10)

false                                                             265079
1                                                                   6696
true                                                                4816
{'@referenceDocumentId': '', '#text': 'false'}                       462
{'@referenceDocumentId': 'RetDoc2030100001', '#text': '0'}           394
0                                                                    139
{'@referenceDocumentId': 'RetDoc1', '#text': 'false'}                132
{'@referenceDocumentId': 'AffiliateListing', '#text': 'false'}        82
{'@referenceDocumentId': 'AffiliateListing', '#text': 'true'}         57
{'@referenceDocumentId': 'STM128', '#text': 'false'}                  38
Name: F9_00_HD_INCLUDES_SUBORD_ORGS, dtype: int64

In [146]:
def func_onekey_mixed(x, key):
    if pd.isnull(x):
        return np.nan
    elif type(x)!=dict:
        return x.lower()
    elif key in x.keys():
        return x[key].lower()
    else:
        return np.nan

In [149]:
df['F9_00_HD_INCLUDES_SUBORD_ORGS__SAFE'] = df['F9_00_HD_INCLUDES_SUBORD_ORGS']

In [164]:
import timeit
start_time = timeit.default_timer()

for col in problem_cols[2:3]:
    variable = col
    key = 'text'
    print(variable, key)
    #print(type(row['variable_name_new']))
    #df.loc[df.index[index], row['variable_name_new']] = 
    #df.loc[df.index[45], 'reptrak100-rank-2013 (binary)'] = 0
    
    df['F9_00_HD_INCLUDES_SUBORD_ORGS'] = df[variable][:].apply(func_onekey_mixed, key=key)
    #df['test'] = df[variable].astype('float')
    
    
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 

F9_00_HD_INCLUDES_SUBORD_ORGS text
# of minutes:  0.0291157016666754


In [165]:
df['F9_00_HD_INCLUDES_SUBORD_ORGS'].value_counts()

false    265079
1          6696
true       4816
0           139
Name: F9_00_HD_INCLUDES_SUBORD_ORGS, dtype: int64

In [158]:
def binarize(df, variable):
    print(df[variable].value_counts(), '\n')
    df[variable] = np.where(df[variable]=='true', 1, df[variable])
    df[variable] = np.where(df[variable]=='false', 0, df[variable])
    df[variable] = np.where(df[variable]=='1', 1, df[variable])
    df[variable] = np.where(df[variable]=='0', 0, df[variable])
    df[variable] = np.where(df[variable]=='X', 1, df[variable])
    print(df[variable].value_counts(), '\n\n')
    return df.sample(5)[['EIN', variable]]

In [166]:
binarize(df, 'F9_00_HD_INCLUDES_SUBORD_ORGS')

false    265079
1          6696
true       4816
0           139
Name: F9_00_HD_INCLUDES_SUBORD_ORGS, dtype: int64 

0    265218
1     11512
Name: F9_00_HD_INCLUDES_SUBORD_ORGS, dtype: int64 




Unnamed: 0,EIN,F9_00_HD_INCLUDES_SUBORD_ORGS
8337,221010215,
344,431708529,
2784,850138775,
2843,340743324,
4492,941126450,


In [167]:
df['F9_00_HD_INCLUDES_SUBORD_ORGS'] = np.where(df['F9_00_HD_INCLUDES_SUBORD_ORGS']==1, 1, 0)
df['F9_00_HD_INCLUDES_SUBORD_ORGS'].value_counts()

0    1715544
1      11512
Name: F9_00_HD_INCLUDES_SUBORD_ORGS, dtype: int64

In [176]:
#df = df.drop('test', 1)
#df = df.drop('F9_00_HD_INCLUDES_SUBORD_ORGS__SAFE', 1)

#### F9_00_HD_SPECIAL_CONDITION_DESC

In [185]:
df['F9_00_HD_SPECIAL_CONDITION_DESC__SAFE'] = df['F9_00_HD_SPECIAL_CONDITION_DESC']

In [186]:
for index, row in df[df['F9_00_HD_SPECIAL_CONDITION_DESC'].notnull()][:5].iterrows():
    print(type(row['F9_00_HD_SPECIAL_CONDITION_DESC']), row['F9_00_HD_SPECIAL_CONDITION_DESC'])

<class 'str'> EXTENSION GRANTED TO 11152011
<class 'str'> EXTENSION GRANTED TO 21511
<class 'str'> EXTENDED TO FEBRUARY 15 2011
<class 'list'> ['WITH AN EXPLANATORY STATEMENT.', 'THE TAXPAYER FILES AND REPORTS ITS ACTIVITIES ON', 'THE FEDERAL EXTENSIONS FILED ON A CALENDAR YEAR', 'PROTECTIVE FEDERAL EXTENSIONS WERE FILED ON A', 'FISCAL YEAR 6-30-2010 EXTENSIONS ARE INCLUDED', 'E-FILING: ACCOUNTING PERIODS & FEDERAL EXTENSIONS', 'DID NOT AUTHORIZE A CHANGE IN ACCOUNTING PERIOD.', 'CALENDAR YEAR BASIS BUT THE BOARD OF DIRECTORS', 'BASIS ARE REVOKED AND RESCINDED AND THE', 'A JUNE 30 FISCAL YEAR BASIS.']
<class 'list'> ['YEAR JUNE 30, PERIOD. THE TAXPAYER FILED VALID', 'TIMELY EXTENSIONS ON A FISCAL YEAR BASIS.', 'THE TAXPAYER REPORTS ITS ACTIVTIES ON A FISCAL', 'THE TAXPAYER HAD FILED PROTECTIVE EXTENSIONS', 'THE BOARD DID NOT AUTHORIZE THE CHANGE IN', 'TAX RETURN.', 'SEE ATTACHED MEMORANDUM INCLUDED WITH THE', 'RULES REQUIRED UNDER GAAP AND GAAS.', 'ON A CALENDAR YEAR BASIS BECAUSE THE 

In [205]:
def func_join_list(x, tipo):
    if pd.isnull(x):
        return np.nan
    elif type(x)!=tipo:
        return x
    elif type(x)==tipo:
        return ' '.join(x)
    else:
        return np.nan

In [219]:
def func_join_list(x):
    if pd.isnull(x):
        return np.nan
    elif type(x)!=list:
        return x
    elif type(x)==list:
        return ' '.join(x)
    else:
        return np.nan

In [225]:
df['F9_00_HD_SPECIAL_CONDITION_DESC'] = df['F9_00_HD_SPECIAL_CONDITION_DESC'].apply(lambda x: ' '.join(x) if type(x)==list else x)

In [226]:
df['F9_00_HD_SPECIAL_CONDITION_DESC'].value_counts().head()

PUBLIC DISCLOSURE COPY         26
HURRICANE IRMA                 18
HURRICANE SANDY                12
EXTENDED TO 8172015            11
EXTENSION GRANTED TO 111519    11
Name: F9_00_HD_SPECIAL_CONDITION_DESC, dtype: int64

In [227]:
for index, row in df[df['F9_00_HD_SPECIAL_CONDITION_DESC'].notnull()][:5].iterrows():
    print(type(row['F9_00_HD_SPECIAL_CONDITION_DESC']), row['test'])

<class 'str'> EXTENSION GRANTED TO 11152011
<class 'str'> EXTENSION GRANTED TO 21511
<class 'str'> EXTENDED TO FEBRUARY 15 2011
<class 'str'> WITH AN EXPLANATORY STATEMENT. THE TAXPAYER FILES AND REPORTS ITS ACTIVITIES ON THE FEDERAL EXTENSIONS FILED ON A CALENDAR YEAR PROTECTIVE FEDERAL EXTENSIONS WERE FILED ON A FISCAL YEAR 6-30-2010 EXTENSIONS ARE INCLUDED E-FILING: ACCOUNTING PERIODS & FEDERAL EXTENSIONS DID NOT AUTHORIZE A CHANGE IN ACCOUNTING PERIOD. CALENDAR YEAR BASIS BUT THE BOARD OF DIRECTORS BASIS ARE REVOKED AND RESCINDED AND THE A JUNE 30 FISCAL YEAR BASIS.
<class 'str'> YEAR JUNE 30, PERIOD. THE TAXPAYER FILED VALID TIMELY EXTENSIONS ON A FISCAL YEAR BASIS. THE TAXPAYER REPORTS ITS ACTIVTIES ON A FISCAL THE TAXPAYER HAD FILED PROTECTIVE EXTENSIONS THE BOARD DID NOT AUTHORIZE THE CHANGE IN TAX RETURN. SEE ATTACHED MEMORANDUM INCLUDED WITH THE RULES REQUIRED UNDER GAAP AND GAAS. ON A CALENDAR YEAR BASIS BECAUSE THE BOARD OF ITS CALEDNAR YEAR EXTENSIONS. EXTENSIONS AND HAS A

In [235]:
#df = df.drop('test', 1)
df = df.drop('F9_00_HD_SPECIAL_CONDITION_DESC__SAFE', 1)

#### F9_06_PC_STATES_WHERE_RET_FILED

In [229]:
df['F9_06_PC_STATES_WHERE_RET_FILED__SAFE'] = df['F9_06_PC_STATES_WHERE_RET_FILED']

In [230]:
df['F9_06_PC_STATES_WHERE_RET_FILED'] = df['F9_06_PC_STATES_WHERE_RET_FILED'].apply(lambda x: ' '.join(x) if type(x)==list else x)

In [231]:
df['F9_06_PC_STATES_WHERE_RET_FILED'].value_counts().head()

CA    119601
NY     91240
PA     46096
MA     44149
OH     40398
Name: F9_06_PC_STATES_WHERE_RET_FILED, dtype: int64

In [237]:
df[['F9_06_PC_STATES_WHERE_RET_FILED__SAFE', 'F9_06_PC_STATES_WHERE_RET_FILED']][:1]

Unnamed: 0,F9_06_PC_STATES_WHERE_RET_FILED__SAFE,F9_06_PC_STATES_WHERE_RET_FILED
0,"[PA, NJ, DE]",PA NJ DE


In [238]:
df = df.drop('F9_06_PC_STATES_WHERE_RET_FILED__SAFE', 1)

#### Save DF

In [240]:
import timeit
start_time = timeit.default_timer()
df.to_pickle('all filings - with 185 newly named control variables (with parsed sub-key variables and fillnull).pkl')
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 

# of minutes:  0.5855652249999973


In [5]:
%%time
df = pd.read_pickle('all filings - with 185 newly named control variables (with parsed sub-key variables and fillnull).pkl')
print('# of columns:', len(df.columns))
print('# of observations:', len(df))
df[:2]

# of columns: 191
# of observations: 1727056
Wall time: 43.6 s


Unnamed: 0,DLN,EIN,F9_09_PC_FEES_FOR_SVCE_FR_TOT,OrganizationName,TaxPeriod,URL,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,F9_10_PC_UNSECURED_NOTES_BOY,F9_12_PC_ACCTG_METHOD_OTHER__description
0,93493313013011,232705170,0.0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,201012,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,2010-12-31,2010,1,0,,0,,1,0,,1473903.0,0,0,0,MICHAEL ANTON,,PA,0,1,0,,0,,1992.0,0.0,1439340.0,1044925.0,638637.0,10,30447.0,1753405.0,243131.0,0.0,0,0.0,0.0,89152.0,193604.0,0,2440859.0,881768.0,195892,0,0.0,450430.0,1075372.0,0,0.0,10,0.0,925000.0,33563.0,1990429.0,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,459751.0,1000,0.0,0.0,0.0,1925215.0,1384751.0,171810.0,1473903.0,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,1,10,10,0,0,0,0,0,PA NJ DE,0,0,0,1,0.0,0.0,0,0.0,0.0,0.0,1439340.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1439340.0,1000.0,0.0,1473903.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,21675.0,0.0,215.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1384751.0,195892.0,145115.0,1043744.0,0.0,0.0,0.0,256845.0,86228.0,0.0,1,0,240077.0,0.0,332660.0,270700.0,0.0,0.0,2440859.0,0.0,0.0,0.0,89152.0,0.0,0,1,0,0,1,0,0,1,0.0,
1,93493313013111,581805618,0.0,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,201106,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,2011-06-30,2010,0,0,,0,,1,0,1736.0,266420.0,0,0,0,,,WY,0,1,0,,0,,1993.0,0.0,0.0,0.0,0.0,13,1425.0,1437850.0,189785.0,0.0,0,0.0,222839.0,-39085.0,-36926.0,0,1433342.0,261190.0,0,0,0.0,34577.0,224264.0,0,0.0,19,0.0,0.0,828.0,1398765.0,PROVIDE HOUSING FOR THE ELDERLY AND THE DISABLED UNDER SECTION 202 OF THE NATIONAL HOUSING ACT UNDER AN AGREEMENT WITH THE DEPARTMENT OF HUD.,222550.0,0,265592.0,82955.0,71405.0,1455332.0,305505.0,17482.0,266420.0,1,1,1,0,1,1,1,1,0,1,1,0,1,0,0,0,0,0,1,1,1,13,19,0,1,0,0,0,,0,0,1,0,0.0,0.0,1,411648.0,0.0,1180355.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,265592.0,0.0,0.0,0.0,265592.0,266420.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7500.0,0.0,0.0,0.0,21600.0,0.0,0.0,0.0,17714.0,17714.0,0.0,0.0,59440.0,59440.0,0.0,0.0,5801.0,5801.0,0.0,0.0,0.0,0.0,305505.0,0.0,29100.0,276405.0,0.0,250.0,22261.0,2187206.0,904332.0,0.0,1,0,11349.0,0.0,0.0,0.0,7035.0,0.0,1433342.0,0.0,0.0,0.0,-39085.0,0.0,0,1,0,0,1,1,1,1,0.0,


In [241]:
import timeit
start_time = timeit.default_timer()
df.to_csv('all filings - with 185 newly named control variables (with parsed sub-key variables and fillnull)).csv', index=False)
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 

# of minutes:  6.210107166666663


In [247]:
len('F9_12_PC_ACCTG_METHOD_OTHER__description')

40

In [242]:
import timeit
start_time = timeit.default_timer()
#df[[c for c in df.columns.tolist() if c not in exclude]].to_stata('all filings - with 185 newly named control variables (with parsed sub-key variables and fillnull).dta', version=117)
df.to_stata('all filings - with 185 newly named control variables (with parsed sub-key variables and fillnull).dta', version=117)
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 

C:\Users\Gregory\Anaconda3\lib\site-packages\pandas\io\stata.py:2136: InvalidColumnName: 
Not all pandas column names were valid Stata variable names.
The following replacements have been made:

    b'F9_01_PC_PROF_FUNDRISING_EXP_CURR'   ->   F9_01_PC_PROF_FUNDRISING_EXP_CUR
    b'F9_01_PC_PROF_FUNDRISING_EXP_PRIOR'   ->   F9_01_PC_PROF_FUNDRISING_EXP_PRI
    b'F9_06_PC_MONITORING_OF_COI_POLICY'   ->   F9_06_PC_MONITORING_OF_COI_POLIC
    b'F9_12_PC_ACCTG_METHOD_OTHER__description'   ->   F9_12_PC_ACCTG_METHOD_OTHER__des

If this is not what you expect, please make sure you have Stata-compliant
column names in your DataFrame (strings only, max 32 characters, only
alphanumerics and underscores, no Stata reserved words)



# of minutes:  3.103392418333351
