# 12/8/2020

Main purpose of notebook:
- In this notebook I change the data type for relevant variables to *int* or *float*. Implementing new approach here for changing data types

Steps:
- Read in *concordance_VERIFIED.xlsx* in order to access the *python_data_type* column
    - Collapse to *new_variables_df* then use that DF

- Read in DF: 
    - *all filings nov. 2020 - all control variables (with parsed sub-key variables).pkl*
 
- Fix missing values for one row for *TaxPeriod* and *fiscal_year*

- I loop over variables in each *data_type_xsd* (e.g., DateType, StateType, etc.) and inspect the *dtypes* and sample values for these variables
    - I then apply a value of *python_data_type* into *new_variables_df* for those variables:
        - either 'string', 'DateTime', or 'Int64'
    - In the next run this section can be used solely for verification -- the updated *concordance* file will contain the data type in *python_data_type*

- SIDEBAR -- Find out how many are in the e-file data

- Fill EIN for one obs taking from *Filer* column

- Change data types for *DateTime* and *Int64* variables:
    - string variables -- nothing done
	- DateTime variables -- change both
	- Int64 variables -- change all with one-liner: 
	    - df[Int64_vars] = df[Int64_vars].apply(pd.to_numeric)
        - This one-liner averts issues with converting to 'Int64' -- it chooses either 'Int64' or 'float'

- Save DF:
	- *all filings nov. 2020 - all control variables (with parsed sub-key variables and reformatted types).pkl*
	
- Merge *python_data_type* into *concordance* and save *concordance_VERIFIED.xlsx*

- Look at 501(c)(3)s based off the variable *513* and save a list: *ein_list_501c3.json* (N=1,435,470)
    - This section is not needed      

Note:
- Even though I use 'Int64' as the value for all integer columns in *python_data_type*, the *pd.numeric( )* code chooses whether to convert to 'Int64' or 'float'	
- Three variables -- *fiscal_year*,  *TaxPeriod*, and *F9_00_HD_TAX_PER_END* -- are all based off the same date, which is the *END* of the tax period, while *F9_00_HD_TAX_YEAR* reflects the year in which the tax period *BEGINS*. See *IRS 990 e-File Data -- CONTROL VARIABLES (A2) -- Combine Columns (Python 3.6).ipynb*

# Load Packages and Connect to MongoDB

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

In [2]:
print(pd.__version__)

1.0.1


In [3]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)

#### Set working directory

In [4]:
cd "C:\\Users\\Gregory\\IRS 990 Control Variables\\"

C:\Users\Gregory\IRS 990 Control Variables


# Read in Concordance File
We are going to read in two codebooks. First, there is the 'concordance' file. Specifically, before re-arranging and renaming variables, we will read in the relevant section from the *master concordance* file, and then use this file to identify the relevant 'compensation' variables. In a following notebook, we will be using the *new_variable_name* field as our variable name.

In [5]:
concordance = pd.read_excel('concordance_VERIFIED.xlsx')
print('# of columns:', len(concordance.columns))
print('# of observations:', len(concordance))
concordance[:2]

# of columns: 16
# of observations: 384


Unnamed: 0,xpath,project,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,python_data_type,BINARIZE,MongoDB_Name,sub_key,sub_sub_key
0,/Return/ReturnHeader/TaxPeriodEndDate,,F9_00_HD_TAX_PER_END,,,Will be nested under ReturnHeader,,Tax period end date,HEADER,HD,DateType,,,TaxPeriodEndDate,,
1,/Return/ReturnHeader/TaxPeriodEndDt,,F9_00_HD_TAX_PER_END,,,Will be nested under ReturnHeader,,Tax period end date,HEADER,HD,DateType,,,TaxPeriodEndDt,,


In [6]:
concordance[concordance['sub_key'].notnull()][['variable_name_new', 'MongoDB_Name', 'sub_key']]

Unnamed: 0,variable_name_new,MongoDB_Name,sub_key
5,F9_00_HD_SIGNING_OFFICER_SIGNTR,BusinessOfficerGrp,SignatureDt
6,F9_00_HD_SIGNING_OFFICER_SIGNTR,Officer,DateSigned
9,F9_00_HD_FILER_STATE_US,Filer,USAddress
10,F9_00_HD_FILER_STATE_US,Filer,USAddress
242,F9_09_PC_COMP_OFFICERS_TOTAL,CompCurrentOfcrDirectorsGrp,TotalAmt
...,...,...,...
373,F9_10_PC_CASH_NON_INTEREST_EOY,CashNonInterestBearingGrp,EOYAmt
374,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,SavingsAndTempCashInvestments,BOY
375,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,SavingsAndTempCashInvstGrp,BOYAmt
376,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,SavingsAndTempCashInvestments,EOY


# Read 990 DB into PANDAS DF
We can modify the above code block to read all filings into a PANDAS dataframe.

In [13]:
%%time
df = pd.read_pickle('all filings nov. 2020 - all control variables (with parsed sub-key variables).pkl')
print('# of columns:', len(df.columns))
print('# of observations:', len(df))
df[:1]

# of columns: 200
# of observations: 1895016
Wall time: 2min 2s


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,F9_09_PC_FEES_FOR_SVCE_FR_TOT,fiscal_year,Filer,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,501c3,F9_00_HD_FILER_STATE_US
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,,2010,"{'EIN': '232705170', 'Name': {'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}, 'NameControl': 'RONA', 'Phone': '8565826843', 'USAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300...",2016-02-24 21:20:13Z,1,,,,,1,,,1473903,0,,,MICHAEL ANTON,2011-11-04,,PA,2010-12-31,2010,2011-11-09T06:41:09-06:00,,1,,,,,1992,0,1439340,1044925,638637,10,30447,1753405,243131,0,0,0,0,89152,193604,,2440859,881768,195892,0,0,450430,1075372,0,0,10,0,925000,33563,1990429,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,459751,1000,0,0,0,1925215,1384751,171810,1473903,0,0,0,1,1,0,0,1,0,0,0,0,0,0,,1,0,,0,0,0,1,1,1,10,10,0,0,,,,"[PA, NJ, DE]",0,0,0,1,0,0,0,0,0,0,1439340,,,,,,,,,,,,,,,1439340,1000,,1473903,,,,,,,,,21675,,215,,,,,,,,,,,,,,,,,,,,1384751,195892,145115,1043744,,,,256845,86228,,1,,240077,,332660,270700,,,,2440859,,,,89152,,0,1,,,1,,0,1,1,PA


In [14]:
df[['F9_00_HD_SIGNING_OFFICER_SIGNTR']].sample(5)

Unnamed: 0,F9_00_HD_SIGNING_OFFICER_SIGNTR
727052,2015-04-15
478463,2014-02-27
909575,2016-04-27
1400041,2018-07-26
455414,2014-02-12


# Combine variables for alternative concordance file
Next run, I can delete the 'sub_keys' column from here -- I don't use it in this notebook

In [15]:
def agg_funcs(x):
    names = {
        #'name': x['variable_name_new'].head(1).values[0],
        'original_names':  list(set(x['MongoDB_Name'].tolist())),
        'sub_keys':  list(set(x['sub_key'].tolist())),
        'data_type_xsd': x['data_type_xsd'].head(1).values[0]
    }
    #THE FOLLOWING SHORTCUT WORKS BUT CHANGES THE ORDER OF THE COLUMNS
    #return pd.Series(names, index = list(names.keys()))
    return pd.Series(names, index=['original_names', 'sub_keys', 'data_type_xsd'])
new_variables_df = concordance.groupby(['variable_name_new']).apply(agg_funcs)
new_variables_df = new_variables_df.reset_index()
print('# of variables:', len(new_variables_df))
new_variables_df[:]

# of variables: 193


Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd
0,F9_00_HD_ADDR_CHANGE,"[AddressChange, AddressChangeInd]",[nan],CheckboxType
1,F9_00_HD_AMENDED_RETURN,"[AmendedReturn, AmendedReturnInd]",[nan],CheckboxType
2,F9_00_HD_BUILD_TIME_STAMP,[BuildTS],[nan],TimestampType
3,F9_00_HD_CTRY_OF_DOMICILE,"[LegalDomicileCountryCd, CountryLegalDomicile]",[nan],CountryType
4,F9_00_HD_EXEMPT_STATUS_4847A1,"[Organization4947a1NotPFInd, Organization4947a1]",[nan],CheckboxType
...,...,...,...,...
188,F9_12_PC_AUDIT_COMMITTEE,"[AuditCommittee, AuditCommitteeInd]",[nan],BooleanType
189,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,"[FederalGrantAuditPerformed, FederalGrantAuditPerformedInd]",[nan],BooleanType
190,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,"[FederalGrantAuditRequiredInd, FederalGrantAuditRequired]",[nan],BooleanType
191,F9_12_PC_FINCL_STMTS_AUDITED,"[FSAuditedInd, FSAudited]",[nan],BooleanType


In [16]:
new_variables_df['len'] = new_variables_df['original_names'].apply(lambda x: len(x))
print(new_variables_df['len'].value_counts(), '\n')
new_variables_df

2    189
1      4
Name: len, dtype: int64 



Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len
0,F9_00_HD_ADDR_CHANGE,"[AddressChange, AddressChangeInd]",[nan],CheckboxType,2
1,F9_00_HD_AMENDED_RETURN,"[AmendedReturn, AmendedReturnInd]",[nan],CheckboxType,2
2,F9_00_HD_BUILD_TIME_STAMP,[BuildTS],[nan],TimestampType,1
3,F9_00_HD_CTRY_OF_DOMICILE,"[LegalDomicileCountryCd, CountryLegalDomicile]",[nan],CountryType,2
4,F9_00_HD_EXEMPT_STATUS_4847A1,"[Organization4947a1NotPFInd, Organization4947a1]",[nan],CheckboxType,2
...,...,...,...,...,...
188,F9_12_PC_AUDIT_COMMITTEE,"[AuditCommittee, AuditCommitteeInd]",[nan],BooleanType,2
189,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,"[FederalGrantAuditPerformed, FederalGrantAuditPerformedInd]",[nan],BooleanType,2
190,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,"[FederalGrantAuditRequiredInd, FederalGrantAuditRequired]",[nan],BooleanType,2
191,F9_12_PC_FINCL_STMTS_AUDITED,"[FSAuditedInd, FSAudited]",[nan],BooleanType,2


In [17]:
new_variables_df[new_variables_df['len']==1]

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len
2,F9_00_HD_BUILD_TIME_STAMP,[BuildTS],[nan],TimestampType,1
7,F9_00_HD_FILER_STATE_US,[Filer],[USAddress],StateType,1
137,F9_09_PC_FEES_FOR_SVCE_FR_TOT,[FeesForServicesProfFundraising],"[Total, TotalAmt]",USAmountType,1
192,TaxPeriod,[TaxPeriod],[nan],YearMonthType,1


### Look at different data types in turn

In [18]:
new_variables_df['data_type_xsd'].value_counts()

USAmountType            78
BooleanType             36
USAmountNNType          32
CheckboxType            22
IntegerNNType            4
CountType                4
StateType                3
DateType                 2
YearType                 2
TimestampType            2
LineExplanationType      2
ShortExplanationType     1
CountryType              1
TextType                 1
StringType               1
PersonNameType           1
YearMonthType            1
Name: data_type_xsd, dtype: int64

In [26]:
type_list = list(set(new_variables_df['data_type_xsd'].tolist()))
print(len(type_list))
print(type_list)
type_list[16]

17
['DateType', 'StateType', 'PersonNameType', 'YearType', 'BooleanType', 'CountType', 'USAmountType', 'YearMonthType', 'ShortExplanationType', 'TimestampType', 'CountryType', 'TextType', 'LineExplanationType', 'USAmountNNType', 'StringType', 'CheckboxType', 'IntegerNNType']


'IntegerNNType'

In [21]:
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']=='YearMonthType']['variable_name_new'].tolist()]].sample(5)

Unnamed: 0,TaxPeriod
1345055,201712
1570331,201806
1764005,201812
1449880,201712
396661,201209


#### Look for non-concordance variables
All of these can stay the same data type as they currently are

In [50]:
set(new_variables_df['variable_name_new'].tolist()) - set(df.columns.tolist())

set()

In [48]:
set(df.columns.tolist()) - set(new_variables_df['variable_name_new'].tolist())

{'501c3', 'DLN', 'EIN', 'Filer', 'OrganizationName', 'URL', 'fiscal_year'}

In [53]:
non_concordance_vars = ['501c3', 'DLN', 'EIN', 'Filer', 'OrganizationName', 'URL', 'fiscal_year']

In [54]:
df[non_concordance_vars].dtypes

501c3                int32
DLN                 object
EIN                 object
Filer               object
OrganizationName    object
URL                 object
fiscal_year         object
dtype: object

In [85]:
df[df['fiscal_year'].isnull()][[ 'F9_00_HD_TAX_YEAR', 'fiscal_year', 'TaxPeriod', 'F9_00_HD_TAX_PER_END', '501c3']]

Unnamed: 0,F9_00_HD_TAX_YEAR,fiscal_year,TaxPeriod,F9_00_HD_TAX_PER_END,501c3
1895015,2018,,,2018-12-31,1


So, three variables -- *fiscal_year*,  *TaxPeriod*, and *F9_00_HD_TAX_PER_END* -- are all based off the same date, which is the *END* of the tax period, while *F9_00_HD_TAX_YEAR* reflects the year in which the tax period *BEGINS*. See *IRS 990 e-File Data -- CONTROL VARIABLES (A2) -- Combine Columns (Python 3.6).ipynb*

In [86]:
df.loc[[1895015]][[ 'F9_00_HD_TAX_YEAR', 'fiscal_year', 'TaxPeriod', 'F9_00_HD_TAX_PER_END', '501c3']]

Unnamed: 0,F9_00_HD_TAX_YEAR,fiscal_year,TaxPeriod,F9_00_HD_TAX_PER_END,501c3
1895015,2018,,,2018-12-31,1


In [87]:
df[[ 'F9_00_HD_TAX_YEAR', 'fiscal_year', 'TaxPeriod', 'F9_00_HD_TAX_PER_END', '501c3']].sample(5)

Unnamed: 0,F9_00_HD_TAX_YEAR,fiscal_year,TaxPeriod,F9_00_HD_TAX_PER_END,501c3
1210345,2016,2016,201612,2016-12-31,1
1639594,2017,2018,201806,2018-06-30,1
623643,2013,2014,201403,2014-03-31,1
1859651,2019,2019,201912,2019-12-31,0
435624,2012,2013,201306,2013-06-30,0


In [95]:
for index, row in df[:3].iterrows():
    print(type(row['TaxPeriod']), row['TaxPeriod'])

<class 'str'> 201012
<class 'str'> 201106
<class 'str'> 201106


In [94]:
for index, row in df[:3].iterrows():
    print(type(row['fiscal_year']), row['fiscal_year'])

<class 'str'> 2010
<class 'str'> 2011
<class 'str'> 2011


##### Fix the two values

In [93]:
df.loc[1895015, 'fiscal_year'] = '2018'

In [96]:
df.loc[1895015, 'TaxPeriod'] = '201812'

In [190]:
df.loc[[1895015]][[ 'F9_00_HD_TAX_YEAR', 'fiscal_year', 'TaxPeriod', 'F9_00_HD_TAX_PER_END', '501c3']]

Unnamed: 0,F9_00_HD_TAX_YEAR,fiscal_year,TaxPeriod,F9_00_HD_TAX_PER_END,501c3
1895015,2018,2018,201812,2018-12-31,1


In [99]:
df[non_concordance_vars].sample(2)

Unnamed: 0,501c3,DLN,EIN,Filer,OrganizationName,URL,fiscal_year
818870,1,93493355002205,522086928,"{'EIN': '522086928', 'BusinessName': {'BusinessNameLine1Txt': 'Allegany Law Foundation Inc'}, 'BusinessNameControlTxt': 'ALLE', 'USAddress': {'AddressLine1Txt': '110 GREENE ST', 'CityNm': 'CUMBERLAND', 'StateAbbreviationCd': 'MD', 'ZIPCd': '21502'}}",ALLEGANY LAW FOUNDATION INC,https://s3.amazonaws.com/irs-form-990/201503559349300220_public.xml,2015
117645,1,93493314018122,421636592,"{'EIN': '421636592', 'Name': {'BusinessNameLine1': 'Education For Just Peace in Middle East'}, 'NameControl': 'EDUC', 'Phone': '2023320994', 'USAddress': {'AddressLine1': '1736 Columbia Road NW', 'City': 'Washington', 'State': 'DC', 'ZIPCode': '2...",EDUCATION FOR JUST PEACE IN MIDDLE EAST,https://s3.amazonaws.com/irs-form-990/201223149349301812_public.xml,2011


# Look at each *data_type_xsd* in turn -- next run, this section can be deleted and just the concordance file's *python_data_type* column can be used 

- So, I can skip to the level-one section called 'Change Data Types'
- Alternatively, go over this section but do not overwrite the *python_data_type* column in *concordance* unless issues found.

#### *DateType*

In [100]:
print(type_list[0])
print(df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[0]]['variable_name_new'].tolist()]].dtypes)
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[0]]['variable_name_new'].tolist()]].sample(5)

DateType
F9_00_HD_SIGNING_OFFICER_SIGNTR    object
F9_00_HD_TAX_PER_END               object
dtype: object


Unnamed: 0,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_TAX_PER_END
1192110,2017-08-29,2015-12-31
1560724,2018-11-09,2018-06-30
95159,2011-05-02,2010-12-31
979647,2016-05-09,2015-12-31
280115,2012-12-18,2012-08-31


In [42]:
new_variables_df['python_data_type'] = np.where(new_variables_df['data_type_xsd']==type_list[0], 'string', np.NaN)
print(new_variables_df['python_data_type'].value_counts(), '\n')
new_variables_df[new_variables_df['data_type_xsd']==type_list[0]]

nan       191
string      2
Name: python_data_type, dtype: int64 



Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,python_data_type
15,F9_00_HD_SIGNING_OFFICER_SIGNTR,"[BusinessOfficerGrp, Officer]","[DateSigned, SignatureDt]",DateType,2,string
18,F9_00_HD_TAX_PER_END,"[TaxPeriodEndDt, TaxPeriodEndDate]",[nan],DateType,2,string


#### *StateType*
Note some values of *F9_06_PC_STATES_WHERE_RET_FILED* are a list -- but we likely won't use this variable yet, though it could be useful for measuring *geographic scope* of organization.

In [59]:
print(type_list[1])
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[1]]['variable_name_new'].tolist()]].sample(5)

StateType


Unnamed: 0,F9_00_HD_FILER_STATE_US,F9_00_HD_STATE_OF_DOMICILE,F9_06_PC_STATES_WHERE_RET_FILED
1197674,OH,OH,
36753,WV,WV,WV
782511,FL,FL,FL
831602,MO,,"[AK, AZ, AR, CO, CT, FL, GA, IL, KY, LA, ME, MD, MN, MS, NH, NJ, NY, NC, ND, OH, OK, PA, SC, TN, UT, WA, WV]"
867315,SC,SC,


In [46]:
print(df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[1]]['variable_name_new'].tolist()]].dtypes, '\n')
new_variables_df['python_data_type'] = np.where(new_variables_df['data_type_xsd']==type_list[1], 'string', new_variables_df['python_data_type'])
print(new_variables_df['python_data_type'].value_counts(), '\n')
new_variables_df[new_variables_df['data_type_xsd']==type_list[1]]

F9_00_HD_FILER_STATE_US            object
F9_00_HD_STATE_OF_DOMICILE         object
F9_06_PC_STATES_WHERE_RET_FILED    object
dtype: object 

nan       188
string      5
Name: python_data_type, dtype: int64 



Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,python_data_type
7,F9_00_HD_FILER_STATE_US,[Filer],[USAddress],StateType,1,string
17,F9_00_HD_STATE_OF_DOMICILE,"[StateLegalDomicile, LegalDomicileStateCd]",[nan],StateType,2,string
98,F9_06_PC_STATES_WHERE_RET_FILED,"[StatesWhereCopyOfReturnIsFldCd, StatesWhereCopyOfReturnIsFiled]",[nan],StateType,2,string


#### *PersonNameType*

In [101]:
print(type_list[2])
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[2]]['variable_name_new'].tolist()]].sample(5)

PersonNameType


Unnamed: 0,F9_00_HD_PRIN_OFF_NAME
1548814,
1369759,
1493961,
923507,ANDREW C MELZER
688857,NANCY GALLAGHER


In [102]:
print(df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[2]]['variable_name_new'].tolist()]].dtypes, '\n')
new_variables_df['python_data_type'] = np.where(new_variables_df['data_type_xsd']==type_list[2], 'string', new_variables_df['python_data_type'])
print(new_variables_df['python_data_type'].value_counts(), '\n')
new_variables_df[new_variables_df['data_type_xsd']==type_list[2]]

F9_00_HD_PRIN_OFF_NAME    object
dtype: object 

nan       187
string      6
Name: python_data_type, dtype: int64 



Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,python_data_type
14,F9_00_HD_PRIN_OFF_NAME,"[NameOfPrincipalOfficerPerson, PrincipalOfficerNm]",[nan],PersonNameType,2,string


#### *YearType*

- Note new variable type: "Nullable integer data type" - https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

In [107]:
print(type_list[3])
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[3]]['variable_name_new'].tolist()]].sample(5)

YearType


Unnamed: 0,F9_00_HD_TAX_YEAR,F9_00_HD_YEAR_FORMED
1049199,2015,1997.0
1493856,2017,1973.0
1321614,2016,1988.0
229983,2011,
618933,2012,


In [116]:
print(df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[3]]['variable_name_new'].tolist()]].dtypes, '\n')
new_variables_df['python_data_type'] = np.where(new_variables_df['data_type_xsd']==type_list[3], 'Int64', new_variables_df['python_data_type'])
print(new_variables_df['python_data_type'].value_counts(), '\n')
new_variables_df[new_variables_df['data_type_xsd']==type_list[3]]

F9_00_HD_TAX_YEAR       object
F9_00_HD_YEAR_FORMED    object
dtype: object 

nan       185
string      6
Int64       2
Name: python_data_type, dtype: int64 



Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,python_data_type
19,F9_00_HD_TAX_YEAR,"[TaxYear, TaxYr]",[nan],YearType,2,Int64
27,F9_00_HD_YEAR_FORMED,"[FormationYr, YearFormation]",[nan],YearType,2,Int64


In [114]:
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[3]]['variable_name_new'].tolist()]].describe().T

Unnamed: 0,count,unique,top,freq
F9_00_HD_TAX_YEAR,1895016,11,2017,252085
F9_00_HD_YEAR_FORMED,1747938,323,2000,41142


In [70]:
for index, row in df[24:26].itperrows():
    print(row['F9_00_HD_YEAR_FORMED'], type(row['F9_00_HD_YEAR_FORMED']))

1990 <class 'str'>
nan <class 'float'>


#### *BooleanType*

In [123]:
print(type_list[4])
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[4]]['variable_name_new'].tolist()]].sample(5)

BooleanType


Unnamed: 0,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED
817136,0,,0,0,0,1,,0,0,0,0,0,0,0,1,0,0,,0,0,0,1,1,,0,0,,0,0,0,0,1,1.0,,0.0,0
1198544,0,,0,1,0,1,,0,0,0,0,0,0,0,0,1,0,,0,0,0,1,1,,0,0,,0,0,0,0,0,,,0.0,0
1134253,0,,0,0,0,1,,1,0,0,0,0,0,0,0,1,0,,0,0,0,1,1,,0,1,,0,0,1,0,0,,,0.0,0
663129,0,,1,0,0,0,,0,0,0,0,0,0,0,0,0,0,,1,0,1,1,1,,0,0,1.0,0,0,0,0,0,1.0,,0.0,1
93124,0,,0,0,0,1,,0,0,0,0,0,0,0,0,0,0,,0,0,0,1,1,,0,0,,0,0,0,0,0,,,,0


In [125]:
print(df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[4]]['variable_name_new'].tolist()]].dtypes, '\n')
new_variables_df['python_data_type'] = np.where(new_variables_df['data_type_xsd']==type_list[4], 'Int64', new_variables_df['python_data_type'])
print(new_variables_df['python_data_type'].value_counts(), '\n')
new_variables_df[new_variables_df['data_type_xsd']==type_list[4]]

F9_00_HD_GROUP_RETURN                object
F9_00_HD_INCLUDES_SUBORD_ORGS        object
F9_04_PC_FR_EVENT_INC_GT_15K         object
F9_04_PC_GAMING_INC_GT_15K           object
F9_04_PC_PROF_FR_EXP_GT_15K          object
F9_06_PC_990_PROVIDED_GOV_BODY       object
F9_06_PC_ANNUAL_DISC_COVRD_PERS      object
F9_06_PC_CEO_COMPENSTN_PROCESS       object
F9_06_PC_CHANGES_ORGANIZING_DOCS     object
F9_06_PC_CONFLICT_OF_INTEREST        object
F9_06_PC_DECISIONS_SUBJ_APPROVAL     object
F9_06_PC_DELEGATION_MGT_DUTIES       object
F9_06_PC_DELEGATION_OF_MGT           object
F9_06_PC_DOCUMENT_RET_POLICY         object
F9_06_PC_ELECTION_BOARD_MEMBERS      object
F9_06_PC_FAMILY_OR_BUSINESS_REL      object
F9_06_PC_JOINT_VENTURE_INVESTMNT     object
F9_06_PC_JOINT_VENTURE_POLICY        object
F9_06_PC_LOCAL_CHAPTERS              object
F9_06_PC_MATERIAL_DIVERSION          object
F9_06_PC_MEMBERS_OR_STOCKHOLDERS     object
F9_06_PC_MINUTES_COMMITTEES          object
F9_06_PC_MINUTES_GOVERNING_BODY 

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,python_data_type
11,F9_00_HD_GROUP_RETURN,"[GroupReturnForAffiliatesInd, GroupReturnForAffiliates]",[nan],BooleanType,2,Int64
12,F9_00_HD_INCLUDES_SUBORD_ORGS,"[AllAffiliatesIncluded, AllAffiliatesIncludedInd]",[nan],BooleanType,2,Int64
67,F9_04_PC_FR_EVENT_INC_GT_15K,"[FundraisingActivities, FundraisingActivitiesInd]",[nan],BooleanType,2,Int64
68,F9_04_PC_GAMING_INC_GT_15K,"[GamingActivitiesInd, Gaming]",[nan],BooleanType,2,Int64
69,F9_04_PC_PROF_FR_EXP_GT_15K,"[ProfessionalFundraising, ProfessionalFundraisingInd]",[nan],BooleanType,2,Int64
70,F9_06_PC_990_PROVIDED_GOV_BODY,"[Form990ProvidedToGoverningBody, Form990ProvidedToGvrnBodyInd]",[nan],BooleanType,2,Int64
71,F9_06_PC_ANNUAL_DISC_COVRD_PERS,"[AnnualDisclosureCoveredPersons, AnnualDisclosureCoveredPrsnInd]",[nan],BooleanType,2,Int64
72,F9_06_PC_CEO_COMPENSTN_PROCESS,"[CompensationProcessCEOInd, CompensationProcessCEO]",[nan],BooleanType,2,Int64
73,F9_06_PC_CHANGES_ORGANIZING_DOCS,"[ChangeToOrgDocumentsInd, ChangesToOrganizingDocs]",[nan],BooleanType,2,Int64
74,F9_06_PC_CONFLICT_OF_INTEREST,"[ConflictOfInterestPolicyInd, ConflictOfInterestPolicy]",[nan],BooleanType,2,Int64


#### *CountType*

In [126]:
print(type_list[5])
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[5]]['variable_name_new'].tolist()]].sample(5)

CountType


Unnamed: 0,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K
1159529,13,14,0.0,1
1034561,22,22,,1
776587,5,5,1.0,1
1384081,5,6,,0
631935,82,82,0.0,0


In [127]:
print(df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[5]]['variable_name_new'].tolist()]].dtypes, '\n')
new_variables_df['python_data_type'] = np.where(new_variables_df['data_type_xsd']==type_list[5], 'Int64', new_variables_df['python_data_type'])
print(new_variables_df['python_data_type'].value_counts(), '\n')
new_variables_df[new_variables_df['data_type_xsd']==type_list[5]]

F9_06_PC_NUM_IND_VOTING_MEMBERS     object
F9_06_PC_NUM_VOTING_GOV_MEMBERS     object
F9_07_PC_NUM_CONTRCTRS_GRTR_100K    object
F9_07_PC_NUM_INDS_GREATER_100K      object
dtype: object 

nan       145
Int64      42
string      6
Name: python_data_type, dtype: int64 



Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,python_data_type
91,F9_06_PC_NUM_IND_VOTING_MEMBERS,"[NumberIndependentVotingMembers, IndependentVotingMemberCnt]",[nan],CountType,2,Int64
92,F9_06_PC_NUM_VOTING_GOV_MEMBERS,"[GoverningBodyVotingMembersCnt, NbrVotingGoverningBodyMembers]",[nan],CountType,2,Int64
103,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,"[CntrctRcvdGreaterThan100KCnt, NumberOfContractorsGT100K]",[nan],CountType,2,Int64
104,F9_07_PC_NUM_INDS_GREATER_100K,"[NumberIndividualsGT100K, IndivRcvdGreaterThan100KCnt]",[nan],CountType,2,Int64


#### *USAmountType*

In [128]:
print(type_list[6])
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[6]]['variable_name_new'].tolist()]].sample(5)

USAmountType


Unnamed: 0,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_FR_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN
970155,0.0,216433,238688.0,0.0,14.0,69483,241825.0,0.0,0,0.0,32503.0,-35774,29380.0,70458,241825.0,8944,36749,271205.0,0,0.0,0,0,14,33709,281754,0,29533,0,0.0,94406,281754,24923.0,245980,0,0,0.0,173224.0,,43209.0,,,,,,,,29533.0,,,29533.0,245980,,,,,3100.0,,,,,,,51675,3784.0,3784.0,,8600.0,,,,,,70458,,,,-35774,
563963,0.0,540,855.0,0.0,1.0,-80548,174608.0,13459.0,0,0.0,424180.0,18682,-10130.0,22407,448625.0,0,23117,438495.0,0,0.0,0,0,15,-710,151748,71657,372191,273973,274017.0,7615,425721,88163.0,444403,0,31500,0.0,540.0,,,5646.0,,15007.0,,,,,372191.0,,62296.0,372191.0,444403,,3034.0,27307.0,30341.0,815.0,,,5045.0,,,46685.0,9318,84511.0,83792.0,,,,,4406.0,,5986.0,22407,,,61156.0,18682,
110520,0.0,10558014,10480661.0,0.0,0.0,7028,10473661.0,28.0,0,0.0,0.0,0,7028.0,7028,10473661.0,0,0,10480689.0,0,0.0,0,0,0,7028,10558014,0,0,0,0.0,7028,10558014,0.0,10558014,269318,0,1321208.0,,,,,,,,,,,,10558014.0,,,10558014,,,,,,,,,,,,7028,,,,,,,,,,7028,,,,0,
1753390,384278.0,3522869,2602938.0,,100831.0,8746946,2090969.0,1149017.0,0,,637593.0,910291,970884.0,12337837,3519495.0,487145,2954668,4490379.0,0,,0,0,93677,9383169,3127831,990391,557283,1126098,1044248.0,11893999,4253929,3147053.0,5164220,16205,122674,,1434784.0,120589.0,,530741.0,120589.0,1521132.0,,,,,557283.0,,,557283.0,5164220,,138880.0,,138880.0,22067.0,,3665.0,871.0,26018.0,,491478.0,3198270,337457.0,234488.0,,1840597.0,,4826487.0,,,,12337837,,,,910291,-274068.0
978292,,252870,,,,3027,,,0,,,223010,,226037,,0,0,,0,0.0,0,0,0,226037,29860,0,0,0,,3027,29860,,252870,0,0,0.0,252870.0,,,,,,,,,220000.0,,,,,252870,,,,,,,,,,,,6037,220000.0,,,,,,,,,226037,,,,223010,


In [129]:
print(df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[6]]['variable_name_new'].tolist()]].dtypes, '\n')
new_variables_df['python_data_type'] = np.where(new_variables_df['data_type_xsd']==type_list[6], 'Int64', new_variables_df['python_data_type'])
print(new_variables_df['python_data_type'].value_counts(), '\n')
new_variables_df[new_variables_df['data_type_xsd']==type_list[6]]

F9_01_PC_BEN_PAID_MEMB_PRIOR       object
F9_01_PC_CONTR_GRANTS_CURR         object
F9_01_PC_CONTR_GRANTS_PRIOR        object
F9_01_PC_GRANTS_PRIOR              object
F9_01_PC_INVEST_INCOME_PRIOR       object
                                    ...  
F9_11_PC_RECNCLTN_DONATED_SVCES    object
F9_11_PC_RECNCLTN_INVSTMNT_EXP     object
F9_11_PC_RECNCLTN_PRIOR_PER_ADJ    object
F9_11_PC_RECNCLTN_REV_LESS_EXP     object
F9_11_PC_RECNCLTN_UNRLZD_GAIN      object
Length: 78, dtype: object 

Int64     120
nan        67
string      6
Name: python_data_type, dtype: int64 



Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,python_data_type
28,F9_01_PC_BEN_PAID_MEMB_PRIOR,"[PYBenefitsPaidToMembersAmt, BenefitsPaidToMembersPriorYear]",[nan],USAmountType,2,Int64
29,F9_01_PC_CONTR_GRANTS_CURR,"[ContributionsGrantsCurrentYear, CYContributionsGrantsAmt]",[nan],USAmountType,2,Int64
30,F9_01_PC_CONTR_GRANTS_PRIOR,"[ContributionsGrantsPriorYear, PYContributionsGrantsAmt]",[nan],USAmountType,2,Int64
31,F9_01_PC_GRANTS_PRIOR,"[PYGrantsAndSimilarPaidAmt, GrantsAndSimilarAmntsPriorYear]",[nan],USAmountType,2,Int64
33,F9_01_PC_INVEST_INCOME_PRIOR,"[PYInvestmentIncomeAmt, InvestmentIncomePriorYear]",[nan],USAmountType,2,Int64
...,...,...,...,...,...,...
179,F9_11_PC_RECNCLTN_DONATED_SVCES,"[ReconcilationDonatedServices, DonatedServicesAndUseFcltsAmt]",[nan],USAmountType,2,Int64
180,F9_11_PC_RECNCLTN_INVSTMNT_EXP,"[InvestmentExpenseAmt, ReconcilationInvestExpenses]",[nan],USAmountType,2,Int64
181,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,"[PriorPeriodAdjustmentsAmt, ReconcilationPriorAdjustment]",[nan],USAmountType,2,Int64
182,F9_11_PC_RECNCLTN_REV_LESS_EXP,"[ReconcilationRevenueExpenses, ReconcilationRevenueExpnssAmt]",[nan],USAmountType,2,Int64


In [None]:
#### **

In [130]:
print(type_list[7])
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[7]]['variable_name_new'].tolist()]].sample(5)

YearMonthType


Unnamed: 0,TaxPeriod
1706135,201806
1239985,201612
462196,201306
1716414,201806
1207458,201612


In [131]:
print(df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[7]]['variable_name_new'].tolist()]].dtypes, '\n')
new_variables_df['python_data_type'] = np.where(new_variables_df['data_type_xsd']==type_list[7], 'string', new_variables_df['python_data_type'])
print(new_variables_df['python_data_type'].value_counts(), '\n')
new_variables_df[new_variables_df['data_type_xsd']==type_list[7]]

TaxPeriod    object
dtype: object 

Int64     120
nan        66
string      7
Name: python_data_type, dtype: int64 



Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,python_data_type
192,TaxPeriod,[TaxPeriod],[nan],YearMonthType,1,string


#### *ShortExplanationType*

In [132]:
print(type_list[8])
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[8]]['variable_name_new'].tolist()]].sample(5)

ShortExplanationType


Unnamed: 0,F9_01_PZ_ORGANIZATIONAL_MISSION
995181,"THE ORGANIZATION COLLECTS MONEY FROM THE GENERAL PUBLIC FOR THE BENEFIT OF CHILDREN. PROGRAMS IS PROVIDED INCLUDE ""SHOP WITH A COP"" ""BACK TO SCHOOL SHOE DRIVE"" A NEEDY FOOD PROGRAM FOR FAMILIES AT CHRISTMAS TIME, A WIDOW ASSITANCE FUND FOR FALLEN..."
60361,COMMUNITY FACILITIES AND SERVICE.
331621,EDUCATION FOR PRE-KINDERGARTEN THROUGH 8TH GRADE.
592434,BUILDING MATERIAL REUSE/RECYCLING
87528,"TO PROVIDE AFFORDABLE HOUSING TO SENIOR CITIZENS IN THE MARION COUNTY AREA, SOUTH CAROLINA."


In [133]:
print(df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[8]]['variable_name_new'].tolist()]].dtypes, '\n')
new_variables_df['python_data_type'] = np.where(new_variables_df['data_type_xsd']==type_list[8], 'string', new_variables_df['python_data_type'])
print(new_variables_df['python_data_type'].value_counts(), '\n')
new_variables_df[new_variables_df['data_type_xsd']==type_list[8]]

F9_01_PZ_ORGANIZATIONAL_MISSION    object
dtype: object 

Int64     120
nan        65
string      8
Name: python_data_type, dtype: int64 



Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,python_data_type
57,F9_01_PZ_ORGANIZATIONAL_MISSION,"[ActivityOrMissionDescription, ActivityOrMissionDesc]",[nan],ShortExplanationType,2,string


#### *TimestampType*

In [134]:
print(type_list[9])
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[9]]['variable_name_new'].tolist()]].sample(5)

TimestampType


Unnamed: 0,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_TIME_STAMP
1265768,2017-02-10 21:41:12Z,2017-11-15T04:14:44-08:00
1562847,2019-02-21 02:37:17Z,2019-02-20T12:09:37-06:00
1303547,2018-03-14 21:41:22Z,2018-01-09T11:36:43-08:00
1235249,2017-02-10 21:41:12Z,2017-11-15T08:47:04-08:00
786862,2016-02-25 16:41:14Z,2015-11-16T08:10:02-06:00


In [135]:
print(df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[9]]['variable_name_new'].tolist()]].dtypes, '\n')
new_variables_df['python_data_type'] = np.where(new_variables_df['data_type_xsd']==type_list[9], 'DateTime', new_variables_df['python_data_type'])
print(new_variables_df['python_data_type'].value_counts(), '\n')
new_variables_df[new_variables_df['data_type_xsd']==type_list[9]]

F9_00_HD_BUILD_TIME_STAMP    object
F9_00_HD_TIME_STAMP          object
dtype: object 

Int64       120
nan          63
string        8
DateTime      2
Name: python_data_type, dtype: int64 



Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,python_data_type
2,F9_00_HD_BUILD_TIME_STAMP,[BuildTS],[nan],TimestampType,1,DateTime
20,F9_00_HD_TIME_STAMP,"[Timestamp, ReturnTs]",[nan],TimestampType,2,DateTime


#### *CountryType*

In [153]:
print(type_list[10])
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[10]]['variable_name_new'].tolist()]].sample(5)

CountryType


Unnamed: 0,F9_00_HD_CTRY_OF_DOMICILE
429340,
37455,
518468,
1495801,
174259,


In [154]:
df[df['F9_00_HD_CTRY_OF_DOMICILE'].notnull()][['F9_00_HD_CTRY_OF_DOMICILE']].sample(5)

Unnamed: 0,F9_00_HD_CTRY_OF_DOMICILE
261664,UK
418119,CA
1244229,SZ
1263919,CA
1249424,GM


In [155]:
print(df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[10]]['variable_name_new'].tolist()]].dtypes, '\n')
new_variables_df['python_data_type'] = np.where(new_variables_df['data_type_xsd']==type_list[10], 'string', new_variables_df['python_data_type'])
print(new_variables_df['python_data_type'].value_counts(), '\n')
new_variables_df[new_variables_df['data_type_xsd']==type_list[10]]

F9_00_HD_CTRY_OF_DOMICILE    object
dtype: object 

Int64       120
nan          62
string        9
DateTime      2
Name: python_data_type, dtype: int64 



Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,python_data_type
3,F9_00_HD_CTRY_OF_DOMICILE,"[LegalDomicileCountryCd, CountryLegalDomicile]",[nan],CountryType,2,string


#### *TextType*

In [156]:
print(type_list[11])
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[11]]['variable_name_new'].tolist()]].sample(5)

TextType


Unnamed: 0,F9_00_HD_SPECIAL_CONDITION_DESC
1061998,
23688,
334998,
120733,
317463,


In [159]:
df[df['F9_00_HD_SPECIAL_CONDITION_DESC'].notnull()][['F9_00_HD_SPECIAL_CONDITION_DESC']].sample(5)

Unnamed: 0,F9_00_HD_SPECIAL_CONDITION_DESC
1265574,HURRICANE IRMA RELIEF IRC 2017150
496562,EXTENSION GRANTED TO 81514
536514,EXTENSIONS ATTACHED
1472206,EXTENSION GRANTED UNTIL MAY 15 2018
1261443,HURRICANE HARVEY


In [160]:
print(df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[11]]['variable_name_new'].tolist()]].dtypes, '\n')
new_variables_df['python_data_type'] = np.where(new_variables_df['data_type_xsd']==type_list[11], 'string', new_variables_df['python_data_type'])
print(new_variables_df['python_data_type'].value_counts(), '\n')
new_variables_df[new_variables_df['data_type_xsd']==type_list[11]]

F9_00_HD_SPECIAL_CONDITION_DESC    object
dtype: object 

Int64       120
nan          61
string       10
DateTime      2
Name: python_data_type, dtype: int64 



Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,python_data_type
16,F9_00_HD_SPECIAL_CONDITION_DESC,"[SpecialConditionDesc, SpecialConditionDescription]",[nan],TextType,2,string


#### *LineExplanationType*

In [161]:
print(type_list[12])
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[12]]['variable_name_new'].tolist()]].sample(5)

LineExplanationType


Unnamed: 0,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_WEBSITE
53048,,
81450,,http://pbcers.org
277734,,WWW.VISITPHOENIX.COM
285814,,www.alamancedisputesettlement.org
527341,,WWW.WWAMH.ORG


In [164]:
df[df['F9_00_HD_TYPE_ORG_OTHER_DESC'].notnull()][['F9_00_HD_TYPE_ORG_OTHER_DESC']].sample(5)

Unnamed: 0,F9_00_HD_TYPE_ORG_OTHER_DESC
63564,CLUB
972411,A POLITICAL SUBDIVISION
1714084,PUBLIC CHARI
241433,RELIGIOUS ORGANIZATION
984103,PUBLIC CHARITY


In [165]:
print(df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[12]]['variable_name_new'].tolist()]].dtypes, '\n')
new_variables_df['python_data_type'] = np.where(new_variables_df['data_type_xsd']==type_list[12], 'string', new_variables_df['python_data_type'])
print(new_variables_df['python_data_type'].value_counts(), '\n')
new_variables_df[new_variables_df['data_type_xsd']==type_list[12]]

F9_00_HD_TYPE_ORG_OTHER_DESC    object
F9_00_HD_WEBSITE                object
dtype: object 

Int64       120
nan          59
string       12
DateTime      2
Name: python_data_type, dtype: int64 



Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,python_data_type
24,F9_00_HD_TYPE_ORG_OTHER_DESC,"[OtherOrganizationDsc, TypeOfOrgOtherDescription]",[nan],LineExplanationType,2,string
26,F9_00_HD_WEBSITE,"[WebSite, WebsiteAddressTxt]",[nan],LineExplanationType,2,string


#### *USAmountNNType*

In [166]:
print(type_list[13])
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[13]]['variable_name_new'].tolist()]].sample(5)

USAmountNNType


Unnamed: 0,F9_00_HD_GROSS_RCPT,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY
418667,313111,,,,285561.0,,,,,,2475.0,2475.0,4950.0,,23340.0,42370.0,65710.0,,3068.0,5574.0,8642.0,,,,,255284,0,67840,187444,,129961.0,
73222,552401,,,,,,,,,,,,,,,,,,,,,,,,,84868,0,6779,78089,,,107931.0
666630,279761,,,,153803.0,,,,,,,3772.0,3772.0,3444.0,13556.0,66262.0,83262.0,613.0,1865.0,16537.0,19015.0,,,,,388927,9326,66857,312744,,71531.0,
160183,98033,,,,,,,,,,3848.0,,3848.0,,,,,,3587.0,,3587.0,,,,,85004,0,85004,0,,26323.0,43806.0
684380,181597,,,,181593.0,,,,,,,,,,,,,,,,,,,,,169609,0,0,169609,,9144.0,


In [167]:
print(df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[13]]['variable_name_new'].tolist()]].dtypes, '\n')
new_variables_df['python_data_type'] = np.where(new_variables_df['data_type_xsd']==type_list[13], 'Int64', new_variables_df['python_data_type'])
print(new_variables_df['python_data_type'].value_counts(), '\n')
new_variables_df[new_variables_df['data_type_xsd']==type_list[13]]

F9_00_HD_GROSS_RCPT                 object
F9_08_PC_COST_OF_GOODS_SOLD         object
F9_08_PC_GROSS_SALES_INVENTORY      object
F9_08_PC_MEMBERSHIP_DUES            object
F9_08_PC_TOTAL_CONTRIBUTIONS        object
F9_09_PC_COMP_DISQUAL_FUNDRAISE     object
F9_09_PC_COMP_DISQUAL_MGMT          object
F9_09_PC_COMP_DISQUAL_PROG_SVCE     object
F9_09_PC_COMP_DISQUAL_TOTAL         object
F9_09_PC_OTHER_EMP_BEN_FUNDRAISE    object
F9_09_PC_OTHER_EMP_BEN_MGMT         object
F9_09_PC_OTHER_EMP_BEN_PROG_SVCE    object
F9_09_PC_OTHER_EMP_BEN_TOTAL        object
F9_09_PC_OTHER_SALARY_FUNDRAISE     object
F9_09_PC_OTHER_SALARY_MGMT          object
F9_09_PC_OTHER_SALARY_PROG_SVCE     object
F9_09_PC_OTHER_SALARY_TOTAL         object
F9_09_PC_PAYROLL_TAX_FUNDRAISE      object
F9_09_PC_PAYROLL_TAX_MGMT           object
F9_09_PC_PAYROLL_TAX_PROG_SVCE      object
F9_09_PC_PAYROLL_TAX_TOTAL          object
F9_09_PC_PENSION_CONT_FUNDRAISE     object
F9_09_PC_PENSION_CONT_MGMT          object
F9_09_PC_PE

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,python_data_type
10,F9_00_HD_GROSS_RCPT,"[GrossReceipts, GrossReceiptsAmt]",[nan],USAmountNNType,2,Int64
111,F9_08_PC_COST_OF_GOODS_SOLD,"[CostOfGoodsSold, CostOfGoodsSoldAmt]",[nan],USAmountNNType,2,Int64
119,F9_08_PC_GROSS_SALES_INVENTORY,"[GrossSalesOfInventoryAmt, GrossSalesOfInventory]",[nan],USAmountNNType,2,Int64
120,F9_08_PC_MEMBERSHIP_DUES,"[MembershipDues, MembershipDuesAmt]",[nan],USAmountNNType,2,Int64
124,F9_08_PC_TOTAL_CONTRIBUTIONS,"[TotalContributions, TotalContributionsAmt]",[nan],USAmountNNType,2,Int64
128,F9_09_PC_COMP_DISQUAL_FUNDRAISE,"[CompDisqualPersons, CompDisqualPersonsGrp]","[Fundraising, FundraisingAmt]",USAmountNNType,2,Int64
129,F9_09_PC_COMP_DISQUAL_MGMT,"[CompDisqualPersons, CompDisqualPersonsGrp]","[ManagementAndGeneralAmt, ManagementAndGeneral]",USAmountNNType,2,Int64
130,F9_09_PC_COMP_DISQUAL_PROG_SVCE,"[CompDisqualPersons, CompDisqualPersonsGrp]","[ProgramServicesAmt, ProgramServices]",USAmountNNType,2,Int64
131,F9_09_PC_COMP_DISQUAL_TOTAL,"[CompDisqualPersons, CompDisqualPersonsGrp]","[Total, TotalAmt]",USAmountNNType,2,Int64
143,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,"[OtherEmployeeBenefitsGrp, OtherEmployeeBenefits]","[Fundraising, FundraisingAmt]",USAmountNNType,2,Int64


#### *StringType*

In [168]:
print(type_list[14])
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[14]]['variable_name_new'].tolist()]].sample(5)

StringType


Unnamed: 0,F9_00_HD_GROSS_EXEMPT_NUM
415173,
744687,
1749553,
327414,
1647240,


In [172]:
df[df['F9_00_HD_GROSS_EXEMPT_NUM'].notnull()][['F9_00_HD_GROSS_EXEMPT_NUM']].sample(5)

Unnamed: 0,F9_00_HD_GROSS_EXEMPT_NUM
100667,928
1025474,646
1119044,964
67900,3099
143253,1017


In [173]:
print(df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[14]]['variable_name_new'].tolist()]].dtypes, '\n')
new_variables_df['python_data_type'] = np.where(new_variables_df['data_type_xsd']==type_list[14], 'string', new_variables_df['python_data_type'])
print(new_variables_df['python_data_type'].value_counts(), '\n')
new_variables_df[new_variables_df['data_type_xsd']==type_list[14]]

F9_00_HD_GROSS_EXEMPT_NUM    object
dtype: object 

Int64       152
nan          26
string       13
DateTime      2
Name: python_data_type, dtype: int64 



Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,python_data_type
9,F9_00_HD_GROSS_EXEMPT_NUM,"[GroupExemptionNumber, GroupExemptionNum]",[nan],StringType,2,string


#### *CheckboxType*

In [175]:
print(type_list[15])
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[15]]['variable_name_new'].tolist()]].sample(15)

CheckboxType


Unnamed: 0,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_INITIAL_RETURN,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_TRUST,F9_01_PC_TERMINATION_CONTRACTION,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER
677314,,,,5.0,,,,,,1.0,,,,1.0,,,,1.0,,,,Modified Cash
915150,,,,,1.0,,,,1.0,,,,,1.0,,,,,1.0,,1.0,
128745,,,,,1.0,,,,1.0,,,,,1.0,1.0,,,1.0,,1.0,,
1451284,,,,6.0,,,,,1.0,,,,,1.0,,,,1.0,,1.0,,
541087,,,,,1.0,,,,,,1.0,,,1.0,1.0,,,,1.0,,1.0,
720191,,,,5.0,,,,,1.0,,,,,1.0,,,1.0,1.0,,,1.0,
622708,,,,,1.0,,,,1.0,,,,,,,,1.0,1.0,,,1.0,
1011692,,,,,1.0,,,,1.0,,,,,1.0,,,1.0,1.0,,1.0,,
83472,,,,3.0,,,,,1.0,,,,,1.0,,,,1.0,,1.0,,
200442,,,,,1.0,,,,1.0,,,,,1.0,,,,1.0,,1.0,,


In [176]:
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[15]]['variable_name_new'].tolist()]].describe().T

Unnamed: 0,count,unique,top,freq
F9_00_HD_ADDR_CHANGE,74526,1,1,74526
F9_00_HD_AMENDED_RETURN,16642,1,1,16642
F9_00_HD_EXEMPT_STATUS_4847A1,1514,1,1,1514
F9_00_HD_EXEMPT_STATUS_501C,486371,24,6,131940
F9_00_HD_EXEMPT_STATUS_501C3,1407131,1,1,1407131
F9_00_HD_FINAL_RETURN,10123,1,1,10123
F9_00_HD_INITIAL_RETURN,18113,1,1,18113
F9_00_HD_TYPE_ORG_ASSOCIATION,83569,1,1,83569
F9_00_HD_TYPE_ORG_CORP,1666981,1,1,1666981
F9_00_HD_TYPE_ORG_OTHER,45946,1,1,45946


##### Look at all CheckboxType variables that have more than one value
- Two variables here

In [177]:
df['F9_00_HD_EXEMPT_STATUS_501C'].value_counts()

6     131940
4      67297
5      57680
7      53636
9      32535
3      28339
8      21452
19     20858
12     19202
14     18970
2      12846
13     10987
10      6182
25      2982
15       803
17       363
26        71
16        63
29        51
11        45
18        22
27        22
23        18
20         7
Name: F9_00_HD_EXEMPT_STATUS_501C, dtype: int64

In [178]:
df['F9_12_PC_ACCTG_METHOD_OTHER'].value_counts()[:10]

MODIFIED CASH          15005
X                       4126
Modified Cash           3656
MODIFIED CASH BASIS     2062
modified cash           1109
HYBRID                  1090
Modified cash            816
Modified Cash Basis      789
MODIFIED ACCRUAL         736
MODIFIED CAS             674
Name: F9_12_PC_ACCTG_METHOD_OTHER, dtype: int64

<br>*F9_12_PC_ACCTG_METHOD_OTHER* will have to be text. The rest can be 'Int64'

In [179]:
new_variables_df[:1]

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,python_data_type
0,F9_00_HD_ADDR_CHANGE,"[AddressChange, AddressChangeInd]",[nan],CheckboxType,2,


In [180]:
print(df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[15]]['variable_name_new'].tolist()]].dtypes, '\n')
new_variables_df['python_data_type'] = np.where(new_variables_df['data_type_xsd']==type_list[15], 'Int64', new_variables_df['python_data_type'])
print(new_variables_df['python_data_type'].value_counts(), '\n')
new_variables_df['python_data_type'] = np.where(new_variables_df['variable_name_new']=='F9_12_PC_ACCTG_METHOD_OTHER', 
                                                'string', new_variables_df['python_data_type'])
print(new_variables_df['python_data_type'].value_counts(), '\n')
new_variables_df[new_variables_df['data_type_xsd']==type_list[15]]

F9_00_HD_ADDR_CHANGE                object
F9_00_HD_AMENDED_RETURN             object
F9_00_HD_EXEMPT_STATUS_4847A1       object
F9_00_HD_EXEMPT_STATUS_501C         object
F9_00_HD_EXEMPT_STATUS_501C3        object
F9_00_HD_FINAL_RETURN               object
F9_00_HD_INITIAL_RETURN             object
F9_00_HD_TYPE_ORG_ASSOCIATION       object
F9_00_HD_TYPE_ORG_CORP              object
F9_00_HD_TYPE_ORG_OTHER             object
F9_00_HD_TYPE_ORG_TRUST             object
F9_01_PC_TERMINATION_CONTRACTION    object
F9_06_PC_FORM_AVAIL_OWN_WEBSITE     object
F9_06_PC_FORM_UPON_REQUEST          object
F9_06_PC_OTHER_WEBSITE              object
F9_06_PC_OWN_WEBSITE                object
F9_07_PC_NO_LISTED_PERS_COMPENSD    object
F9_10_PC_ORG_FOLLOWS_SFAS117        object
F9_10_PC_ORG_NOT_FOLLOW_SFAS117     object
F9_12_PC_ACCTG_METHOD_ACCRUAL       object
F9_12_PC_ACCTG_METHOD_CASH          object
F9_12_PC_ACCTG_METHOD_OTHER         object
dtype: object 

Int64       174
string       13
nan   

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,python_data_type
0,F9_00_HD_ADDR_CHANGE,"[AddressChange, AddressChangeInd]",[nan],CheckboxType,2,Int64
1,F9_00_HD_AMENDED_RETURN,"[AmendedReturn, AmendedReturnInd]",[nan],CheckboxType,2,Int64
4,F9_00_HD_EXEMPT_STATUS_4847A1,"[Organization4947a1NotPFInd, Organization4947a1]",[nan],CheckboxType,2,Int64
5,F9_00_HD_EXEMPT_STATUS_501C,"[Organization501c, Organization501cInd]",[nan],CheckboxType,2,Int64
6,F9_00_HD_EXEMPT_STATUS_501C3,"[Organization501c3Ind, Organization501c3]",[nan],CheckboxType,2,Int64
8,F9_00_HD_FINAL_RETURN,"[FinalReturnInd, TerminatedReturn]",[nan],CheckboxType,2,Int64
13,F9_00_HD_INITIAL_RETURN,"[InitialReturnInd, InitialReturn]",[nan],CheckboxType,2,Int64
21,F9_00_HD_TYPE_ORG_ASSOCIATION,"[TypeOfOrganizationAssocInd, TypeOfOrganizationAssociation]",[nan],CheckboxType,2,Int64
22,F9_00_HD_TYPE_ORG_CORP,"[TypeOfOrganizationCorpInd, TypeOfOrganizationCorporation]",[nan],CheckboxType,2,Int64
23,F9_00_HD_TYPE_ORG_OTHER,"[TypeOfOrganizationOtherInd, TypeOfOrganizationOther]",[nan],CheckboxType,2,Int64


#### *IntegerNNType*

In [181]:
print(type_list[16])
df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[16]]['variable_name_new'].tolist()]].sample(5)

IntegerNNType


Unnamed: 0,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_VOTING_MEMB_GOV_BODY
986724,9,6,75.0,9
354281,11,5,0.0,11
1804675,10,6,40.0,10
56953,6,0,0.0,7
1786670,4,0,,4


In [182]:
print(df[[c for c in new_variables_df[new_variables_df['data_type_xsd']==type_list[16]]['variable_name_new'].tolist()]].dtypes, '\n')
new_variables_df['python_data_type'] = np.where(new_variables_df['data_type_xsd']==type_list[16], 'Int64', new_variables_df['python_data_type'])
print(new_variables_df['python_data_type'].value_counts(), '\n')
new_variables_df[new_variables_df['data_type_xsd']==type_list[16]]

F9_01_PC_INDEP_VOTING_MEMB       object
F9_01_PC_TOT_INDIV_EMPLOYED      object
F9_01_PC_TOT_INDIV_VOLUNTEERS    object
F9_01_PC_VOTING_MEMB_GOV_BODY    object
dtype: object 

Int64       177
string       14
DateTime      2
Name: python_data_type, dtype: int64 



Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,python_data_type
32,F9_01_PC_INDEP_VOTING_MEMB,"[NbrIndependentVotingMembers, VotingMembersIndependentCnt]",[nan],IntegerNNType,2,Int64
46,F9_01_PC_TOT_INDIV_EMPLOYED,"[TotalEmployeeCnt, TotalNbrEmployees]",[nan],IntegerNNType,2,Int64
47,F9_01_PC_TOT_INDIV_VOLUNTEERS,"[TotalNbrVolunteers, TotalVolunteersCnt]",[nan],IntegerNNType,2,Int64
52,F9_01_PC_VOTING_MEMB_GOV_BODY,"[NbrVotingMembersGoverningBody, VotingMembersGoverningBodyCnt]",[nan],IntegerNNType,2,Int64


#### Save *new_variables_df*

In [187]:
len(new_variables_df)

193

In [186]:
new_variables_df['python_data_type'].value_counts()

Int64       177
string       14
DateTime      2
Name: python_data_type, dtype: int64

In [188]:
new_variables_df.to_pickle('new_variables_df (with python_data_type).pkl')

In [7]:
#new_variables_df = pd.read_pickle('new_variables_df (with python_data_type).pkl')

In [6]:
pwd

'C:\\Users\\Gregory\\IRS 990 Control Variables'

In [8]:
len(new_variables_df)

193

In [217]:
new_variables_df['python_data_type'].value_counts()

Int64       177
string       14
DateTime      2
Name: python_data_type, dtype: int64

In [192]:
print(len(df))
print(len(df[df['501c3']==1]))

1895016
1435470


In [193]:
df[:1]

Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,F9_09_PC_FEES_FOR_SVCE_FR_TOT,fiscal_year,Filer,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,501c3,F9_00_HD_FILER_STATE_US
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,,2010,"{'EIN': '232705170', 'Name': {'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}, 'NameControl': 'RONA', 'Phone': '8565826843', 'USAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300...",2016-02-24 21:20:13Z,1,,,,,1,,,1473903,0,,,MICHAEL ANTON,2011-11-04,,PA,2010-12-31,2010,2011-11-09T06:41:09-06:00,,1,,,,,1992,0,1439340,1044925,638637,10,30447,1753405,243131,0,0,0,0,89152,193604,,2440859,881768,195892,0,0,450430,1075372,0,0,10,0,925000,33563,1990429,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,459751,1000,0,0,0,1925215,1384751,171810,1473903,0,0,0,1,1,0,0,1,0,0,0,0,0,0,,1,0,,0,0,0,1,1,1,10,10,0,0,,,,"[PA, NJ, DE]",0,0,0,1,0,0,0,0,0,0,1439340,,,,,,,,,,,,,,,1439340,1000,,1473903,,,,,,,,,21675,,215,,,,,,,,,,,,,,,,,,,,1384751,195892,145115,1043744,,,,256845,86228,,1,,240077,,332660,270700,,,,2440859,,,,89152,,0,1,,,1,,0,1,1,PA


# 12/8/2020 -- SIDEBAR -- Find out how many are in the e-file data
This can be skipped/deleted next run

In [194]:
eins = pd.read_excel('sample ein years.xlsx')
print(len(eins))
eins[:2]

3009


Unnamed: 0,ein,year
0,980391928,2009
1,131760110,2009


In [195]:
eins.dtypes

ein     int64
year    int64
dtype: object

In [196]:
df[:1]

Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,F9_09_PC_FEES_FOR_SVCE_FR_TOT,fiscal_year,Filer,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,501c3,F9_00_HD_FILER_STATE_US
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,,2010,"{'EIN': '232705170', 'Name': {'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}, 'NameControl': 'RONA', 'Phone': '8565826843', 'USAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300...",2016-02-24 21:20:13Z,1,,,,,1,,,1473903,0,,,MICHAEL ANTON,2011-11-04,,PA,2010-12-31,2010,2011-11-09T06:41:09-06:00,,1,,,,,1992,0,1439340,1044925,638637,10,30447,1753405,243131,0,0,0,0,89152,193604,,2440859,881768,195892,0,0,450430,1075372,0,0,10,0,925000,33563,1990429,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,459751,1000,0,0,0,1925215,1384751,171810,1473903,0,0,0,1,1,0,0,1,0,0,0,0,0,0,,1,0,,0,0,0,1,1,1,10,10,0,0,,,,"[PA, NJ, DE]",0,0,0,1,0,0,0,0,0,0,1439340,,,,,,,,,,,,,,,1439340,1000,,1473903,,,,,,,,,21675,,215,,,,,,,,,,,,,,,,,,,,1384751,195892,145115,1043744,,,,256845,86228,,1,,240077,,332660,270700,,,,2440859,,,,89152,,0,1,,,1,,0,1,1,PA


In [197]:
df[['EIN']].dtypes

EIN    object
dtype: object

In [199]:
eins[:1]

Unnamed: 0,ein,year
0,980391928,2009


# Fill EIN for one obs taking from *Filer* column

In [200]:
print(len(df[df['EIN'].isnull()]))

1


In [204]:
df[df['EIN'].isnull()]

Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,F9_09_PC_FEES_FOR_SVCE_FR_TOT,fiscal_year,Filer,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,501c3,F9_00_HD_FILER_STATE_US
1895015,,,,201812,,,2018,"{'EIN': '204814407', 'BusinessName': {'BusinessNameLine1Txt': 'PLAY FLAG FOOTBALL'}, 'BusinessNameControlTxt': 'PLAY', 'PhoneNum': '4083700500', 'USAddress': {'AddressLine1Txt': '545 WESTCHESTER DR NO A', 'CityNm': 'CAMPBELL', 'StateAbbreviationC...",2020-04-17 16:48:07Z,,,,,,1,,,1075,0,,,JOHN MORA,2019-11-14,,CA,2018-12-31,2018,2019-11-15T10:29:26-06:00,,1,,,,WWW.PLAYFLAGFOOTBALL.COM,2006,0,0,0,0,0,0,32567,8092,0,0,0,250,-9303,-7842,,23264,8092,0,0,300,0,250,0,0,3,0,0,0,23264,OPERATING YOUTH SPORTS PROGRAMS FOR THE PUBLIC BENEFIT,10378,0,1075,0,0,32567,10378,0,1075,0,0,0,0,1,0,0,1,0,0,0,1,0,0,,1,0,,0,0,0,1,1,1,0,3,0,0,,,,CA,1,0,0,1,0,0,0,0,0,0,,,,,,,,,,,,,,1075,,,,1075,1075,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10378,0,1729,8649,,6630,4245,215130,196111,,1,,,,,,,,,23264,,,,-9303,,0,,1,,,,0,0,1,CA


In [206]:
df.loc[1895015, 'EIN'] = '204814407'

In [207]:
df.loc[[1895015]]

Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,F9_09_PC_FEES_FOR_SVCE_FR_TOT,fiscal_year,Filer,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,501c3,F9_00_HD_FILER_STATE_US
1895015,,,,201812,204814407,,2018,"{'EIN': '204814407', 'BusinessName': {'BusinessNameLine1Txt': 'PLAY FLAG FOOTBALL'}, 'BusinessNameControlTxt': 'PLAY', 'PhoneNum': '4083700500', 'USAddress': {'AddressLine1Txt': '545 WESTCHESTER DR NO A', 'CityNm': 'CAMPBELL', 'StateAbbreviationC...",2020-04-17 16:48:07Z,,,,,,1,,,1075,0,,,JOHN MORA,2019-11-14,,CA,2018-12-31,2018,2019-11-15T10:29:26-06:00,,1,,,,WWW.PLAYFLAGFOOTBALL.COM,2006,0,0,0,0,0,0,32567,8092,0,0,0,250,-9303,-7842,,23264,8092,0,0,300,0,250,0,0,3,0,0,0,23264,OPERATING YOUTH SPORTS PROGRAMS FOR THE PUBLIC BENEFIT,10378,0,1075,0,0,32567,10378,0,1075,0,0,0,0,1,0,0,1,0,0,0,1,0,0,,1,0,,0,0,0,1,1,1,0,3,0,0,,,,CA,1,0,0,1,0,0,0,0,0,0,,,,,,,,,,,,,,1075,,,,1075,1075,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10378,0,1729,8649,,6630,4245,215130,196111,,1,,,,,,,,,23264,,,,-9303,,0,,1,,,,0,0,1,CA


In [201]:
[c for c in df.columns.tolist() if 'ein' in c.lower()]

['EIN']

In [208]:
df['ein_int'] = df['EIN'].astype('int')
print(len(df[df['ein_int'].isnull()]))

0


In [211]:
ein_list = list(set(eins['ein'].tolist()))
print(len(ein_list))
print(ein_list[:5])

427
[541382657, 43177990, 840865803, 941156365, 420841485]


In [212]:
print(len(df[df['ein_int'].isin(ein_list)]))

3736


In [214]:
dfe = df[df['ein_int'].isin(ein_list)]
print(len(dfe))
print(len(set(dfe['ein_int'].tolist())))
dfe[:1]

3736
412


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,F9_09_PC_FEES_FOR_SVCE_FR_TOT,fiscal_year,Filer,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,501c3,F9_00_HD_FILER_STATE_US,ein_int
26,UNIVERSIDAD INTERAMERICANA DE PUERTO RICO,https://s3.amazonaws.com/irs-form-990/201123129349300522_public.xml,93493312005221,201106,660177776,,2011,"{'EIN': '660177776', 'Name': {'BusinessNameLine1': 'Universidad Interamericana de Puerto Rico'}, 'NameControl': 'UNIV', 'Phone': '7877661912', 'USAddress': {'AddressLine1': 'PO Box 363255', 'City': 'San Juan', 'State': 'PR', 'ZIPCode': '009363255'}}",2016-02-24 21:20:13Z,,,,,,1,,,274436291,0,,,Manuel J Fernos,2011-11-04,,PR,2011-06-30,2010,2011-11-08T09:22:05-06:00,,1,,,,www.inter.edu,1912,0,168629,545736,0,0,3312892,312623607,79184805,17417325,0,0,241866553,31845468,31117570,,495095132,232024936,198931,6914,0,133361481,263142506,460587,460587,23,0,0,3389330,361733651,"The Universidad Interamericana de Puerto Rico, Inc. is a higher education institution pursuing quality, academic excellence with emphasis in the formation of people with high standards, ethical and democratically values.",82853450,21765138,249113194,159737373,152840131,452329909,242590823,139706302,274436291,1,0,0,1,1,1,0,1,1,0,0,1,1,0,,1,0,,1,0,1,1,1,1,0,23,0,1,,,1,"[PR, FL]",1,0,0,,132,27,1,104557,1824953,0,,168629,,,0,168629,1114925,,,,,,168629,249113194,,168629,20650213,249113194,274436291,,,,,,4967605,4492744,9460349,358500,,462911,,,11472846,,4710618,5916282,10626900,1400,20390082,103371759,123763241,107,4033058,8302768,12335933,,712449,2838501,3550950,242590823,198931,67861614,174530278,,31821397,33239611,430375415,182433094,,1,,94647755,,,,13288875,,,495095132,,,,31845468,,0,1,,,1,1,1,1,1,PR,660177776


In [216]:
pd.DataFrame(dfe['fiscal_year'].value_counts())

Unnamed: 0,fiscal_year
2017,407
2012,404
2016,403
2014,403
2013,402
2015,400
2018,395
2011,391
2010,365
2019,166


# Change Data Types

In [220]:
new_variables_df['python_data_type'].value_counts()

Int64       177
string       14
DateTime      2
Name: python_data_type, dtype: int64

### String variables

In [226]:
string_vars = new_variables_df[new_variables_df['python_data_type']=='string']['variable_name_new'].tolist()
print(len(string_vars))
print(string_vars)

14
['F9_00_HD_CTRY_OF_DOMICILE', 'F9_00_HD_FILER_STATE_US', 'F9_00_HD_GROSS_EXEMPT_NUM', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_SIGNING_OFFICER_SIGNTR', 'F9_00_HD_SPECIAL_CONDITION_DESC', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TAX_PER_END', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_WEBSITE', 'F9_01_PZ_ORGANIZATIONAL_MISSION', 'F9_06_PC_STATES_WHERE_RET_FILED', 'F9_12_PC_ACCTG_METHOD_OTHER', 'TaxPeriod']


In [227]:
df[string_vars].dtypes

F9_00_HD_CTRY_OF_DOMICILE          object
F9_00_HD_FILER_STATE_US            object
F9_00_HD_GROSS_EXEMPT_NUM          object
F9_00_HD_PRIN_OFF_NAME             object
F9_00_HD_SIGNING_OFFICER_SIGNTR    object
F9_00_HD_SPECIAL_CONDITION_DESC    object
F9_00_HD_STATE_OF_DOMICILE         object
F9_00_HD_TAX_PER_END               object
F9_00_HD_TYPE_ORG_OTHER_DESC       object
F9_00_HD_WEBSITE                   object
F9_01_PZ_ORGANIZATIONAL_MISSION    object
F9_06_PC_STATES_WHERE_RET_FILED    object
F9_12_PC_ACCTG_METHOD_OTHER        object
TaxPeriod                          object
dtype: object

<br>These 14 variables are already in *string* format so they can be left alone.

### *DateTime columns

In [228]:
DateTime_vars = new_variables_df[new_variables_df['python_data_type']=='DateTime']['variable_name_new'].tolist()
print(len(DateTime_vars))
print(DateTime_vars)

2
['F9_00_HD_BUILD_TIME_STAMP', 'F9_00_HD_TIME_STAMP']


In [229]:
df[DateTime_vars].dtypes

F9_00_HD_BUILD_TIME_STAMP    object
F9_00_HD_TIME_STAMP          object
dtype: object

In [230]:
df[DateTime_vars].sample(5)

Unnamed: 0,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_TIME_STAMP
1336024,2018-03-14 21:41:22Z,2018-01-29T08:19:28-08:00
782080,2016-02-25 16:41:14Z,2015-11-16T09:20:40-06:00
1679873,2019-02-21 02:37:17Z,2019-09-12T13:10:02-05:00
704758,2016-02-25 16:41:14Z,2015-07-31T14:37:38-04:00
566970,2015-11-30 17:44:51Z,2014-09-24T07:44:18-00:00


##### F9_00_HD_BUILD_TIME_STAMP

In [233]:
pd.to_datetime(df.sample(10)['F9_00_HD_BUILD_TIME_STAMP'])

433319    2015-11-30 17:44:51+00:00
1122349   2017-02-10 21:41:12+00:00
132441    2016-02-24 21:20:13+00:00
1016671   2017-02-10 21:41:12+00:00
682475    2016-02-25 16:41:14+00:00
920435    2016-09-27 15:27:22+00:00
432623    2016-03-07 17:11:31+00:00
1346171   2018-06-14 16:35:46+00:00
1486423   2018-06-14 16:35:46+00:00
1500158   2018-06-14 16:35:46+00:00
Name: F9_00_HD_BUILD_TIME_STAMP, dtype: datetime64[ns, UTC]

In [247]:
print(len(df[df['F9_00_HD_BUILD_TIME_STAMP'].isnull()]))

0


In [234]:
%%time
df['F9_00_HD_BUILD_TIME_STAMP'] = pd.to_datetime(df['F9_00_HD_BUILD_TIME_STAMP'])

Wall time: 5.32 s


In [240]:
df.sample(5)['F9_00_HD_BUILD_TIME_STAMP']

401712    2016-03-07 17:11:31+00:00
1675060   2019-02-21 02:37:17+00:00
1391429   2018-06-14 16:35:46+00:00
835886    2016-03-07 17:11:31+00:00
575277    2015-11-30 17:44:51+00:00
Name: F9_00_HD_BUILD_TIME_STAMP, dtype: datetime64[ns, UTC]

In [237]:
df['F9_00_HD_BUILD_TIME_STAMP'].min()

Timestamp('2015-11-30 17:44:51+0000', tz='UTC')

In [238]:
df['F9_00_HD_BUILD_TIME_STAMP'].max()

Timestamp('2020-09-23 17:36:50+0000', tz='UTC')

##### F9_00_HD_TIME_STAMP

In [263]:
df.sample(5)['F9_00_HD_TIME_STAMP']

1366379    2018-05-10 15:42:38-05:00
1829352    2019-11-15 12:15:19-06:00
481438     2014-03-25 11:53:06-07:00
1641341    2019-05-15 07:34:24-05:00
1475602    2018-09-18 10:16:39-05:00
Name: F9_00_HD_TIME_STAMP, dtype: object

In [None]:
print(len(df[df['F9_00_HD_TIME_STAMP'].isnull()]))

In [267]:
%%time
df['F9_00_HD_TIME_STAMP'] = pd.to_datetime(df['F9_00_HD_TIME_STAMP'])

Wall time: 0 ns


<br>Function to deal with missing values -- See *IRS 990 e-File Preparer Data -- (9n) -- Schedule O - Number of Words and Time Delta Variables.ipynb*

In [252]:
def timefunc(x):
    if pd.isnull(x):
        return np.nan
    else: 
        return pd.to_datetime(x)

In [279]:
%%time
df['F9_00_HD_TIME_STAMP_dt'] = df['F9_00_HD_TIME_STAMP'][:].apply(timefunc)
##df['F9_09_PC_TOTAL_FUNC_EXPENSES'] = df['F9_09_PC_TOTAL_FUNC_EXPENSES'].astype('float')
df[DateTime_vars][:6]

Unnamed: 0,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_TIME_STAMP
0,2016-02-24 21:20:13+00:00,2011-11-09 06:41:09-06:00
1,2016-02-24 21:20:13+00:00,2011-11-09 07:32:06-08:00
2,2016-02-24 21:20:13+00:00,2011-11-09 07:33:03-08:00
3,2016-02-24 21:20:13+00:00,2011-11-09 07:54:44-08:00
4,2016-02-24 21:20:13+00:00,2011-11-09 10:05:52-06:00
5,2016-02-24 21:20:13+00:00,2011-11-09 08:42:28-08:00


In [284]:
df.sample(5)['F9_00_HD_TIME_STAMP']

1223522    2017-09-19 07:35:56-07:00
230279     2012-11-15 09:19:39-08:00
84059      2011-05-04 11:37:25-04:00
759656     2015-08-17 16:52:05-05:00
1803658    2020-02-10 12:31:10-06:00
Name: F9_00_HD_TIME_STAMP, dtype: object

In [283]:
df.sample(5)['F9_00_HD_TIME_STAMP_dt']

1891644    2020-02-23 11:16:55-05:00
650231     2015-03-09 07:43:27+00:00
170381     2012-10-08 09:04:30+00:00
1696037    2019-05-13 17:50:48-05:00
448438     2014-01-21 11:19:50-08:00
Name: F9_00_HD_TIME_STAMP_dt, dtype: object

In [285]:
for index, row in df[:5].iterrows():
    print(type(row['F9_00_HD_BUILD_TIME_STAMP']), type(row['F9_00_HD_TIME_STAMP']))
    print(row['F9_00_HD_BUILD_TIME_STAMP'].year)
    print(row['F9_00_HD_TIME_STAMP'].year)
    print(row['F9_00_HD_TIME_STAMP_dt'].year)    

<class 'pandas._libs.tslibs.timestamps.Timestamp'> <class 'datetime.datetime'>
2016
2011
2011
<class 'pandas._libs.tslibs.timestamps.Timestamp'> <class 'datetime.datetime'>
2016
2011
2011
<class 'pandas._libs.tslibs.timestamps.Timestamp'> <class 'datetime.datetime'>
2016
2011
2011
<class 'pandas._libs.tslibs.timestamps.Timestamp'> <class 'datetime.datetime'>
2016
2011
2011
<class 'pandas._libs.tslibs.timestamps.Timestamp'> <class 'datetime.datetime'>
2016
2011
2011


In [282]:
df['F9_00_HD_TIME_STAMP'].min()

datetime.datetime(2002, 1, 1, 3, 19, 19, tzinfo=tzoffset(None, -18000))

In [255]:
df['F9_00_HD_TIME_STAMP'].max()

datetime.datetime(2020, 8, 27, 19, 58, 59, tzinfo=tzoffset(None, -14400))

In [290]:
%%time
df['F9_00_HD_TIME_STAMP_yr'] = df['F9_00_HD_TIME_STAMP'].apply(lambda x: str(x)[:4])
print(df['F9_00_HD_TIME_STAMP_yr'].value_counts(), '\n')
df[['F9_00_HD_TIME_STAMP_yr']][:2]

2018    252068
2019    249630
2017    239015
2016    228991
2015    211775
2014    193702
2013    175554
2012    142315
2011    117559
2020     82890
2010      1333
2009       183
2002         1
Name: F9_00_HD_TIME_STAMP_yr, dtype: int64 

Wall time: 15.5 s


Unnamed: 0,F9_00_HD_TIME_STAMP_yr
0,2011
1,2011


In [294]:
#df = df.drop('F9_00_HD_TIME_STAMP_dt', 1)

In [299]:
df[df['F9_00_HD_TIME_STAMP_yr']=='2002'][['EIN', 'fiscal_year', '501c3', 'F9_00_HD_BUILD_TIME_STAMP',
                                          'F9_00_HD_TIME_STAMP', 'F9_00_HD_TIME_STAMP_yr']]

Unnamed: 0,EIN,fiscal_year,501c3,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_TIME_STAMP,F9_00_HD_TIME_STAMP_yr
234090,10838477,2011,1,2016-02-24 21:20:13+00:00,2002-01-01 03:19:19-05:00,2002


In [None]:
df[DateTime_vars].dtypes

# There is one row with a time stamp <2009

#### Save DF

In [300]:
%%time
df.to_pickle('all filings nov. 2020 - all control variables (with parsed sub-key variables and reformatted types).pkl')

Wall time: 2min 27s


### *Int64* columns

In [301]:
Int64_vars = new_variables_df[new_variables_df['python_data_type']=='Int64']['variable_name_new'].tolist()
print(len(Int64_vars))
print(Int64_vars)

177
['F9_00_HD_ADDR_CHANGE', 'F9_00_HD_AMENDED_RETURN', 'F9_00_HD_EXEMPT_STATUS_4847A1', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_EXEMPT_STATUS_501C3', 'F9_00_HD_FINAL_RETURN', 'F9_00_HD_GROSS_RCPT', 'F9_00_HD_GROUP_RETURN', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 'F9_00_HD_INITIAL_RETURN', 'F9_00_HD_TAX_YEAR', 'F9_00_HD_TYPE_ORG_ASSOCIATION', 'F9_00_HD_TYPE_ORG_CORP', 'F9_00_HD_TYPE_ORG_OTHER', 'F9_00_HD_TYPE_ORG_TRUST', 'F9_00_HD_YEAR_FORMED', 'F9_01_PC_BEN_PAID_MEMB_PRIOR', 'F9_01_PC_CONTR_GRANTS_CURR', 'F9_01_PC_CONTR_GRANTS_PRIOR', 'F9_01_PC_GRANTS_PRIOR', 'F9_01_PC_INDEP_VOTING_MEMB', 'F9_01_PC_INVEST_INCOME_PRIOR', 'F9_01_PC_NET_ASSETS_BOY', 'F9_01_PC_OTHER_EXPENSE_PRIOR', 'F9_01_PC_OTHER_REV_PRIOR', 'F9_01_PC_PROF_FUNDRISING_EXP_CURR', 'F9_01_PC_PROF_FUNDRISING_EXP_PRIOR', 'F9_01_PC_PROG_SERVICE_REV_PRIOR', 'F9_01_PC_REV_LESS_EXP_CURR', 'F9_01_PC_REV_LESS_EXP_PRIOR', 'F9_01_PC_TERMINATION_CONTRACTION', 'F9_01_PC_TOT_ASSETS_EOY', 'F9_01_PC_TOT_EXP_PRIOR', 'F9_01_PC_TOT_FNDR_EXP_CURR', 

In [303]:
df[Int64_vars].dtypes[:50]

F9_00_HD_ADDR_CHANGE                  object
F9_00_HD_AMENDED_RETURN               object
F9_00_HD_EXEMPT_STATUS_4847A1         object
F9_00_HD_EXEMPT_STATUS_501C           object
F9_00_HD_EXEMPT_STATUS_501C3          object
F9_00_HD_FINAL_RETURN                 object
F9_00_HD_GROSS_RCPT                   object
F9_00_HD_GROUP_RETURN                 object
F9_00_HD_INCLUDES_SUBORD_ORGS         object
F9_00_HD_INITIAL_RETURN               object
F9_00_HD_TAX_YEAR                     object
F9_00_HD_TYPE_ORG_ASSOCIATION         object
F9_00_HD_TYPE_ORG_CORP                object
F9_00_HD_TYPE_ORG_OTHER               object
F9_00_HD_TYPE_ORG_TRUST               object
F9_00_HD_YEAR_FORMED                  object
F9_01_PC_BEN_PAID_MEMB_PRIOR          object
F9_01_PC_CONTR_GRANTS_CURR            object
F9_01_PC_CONTR_GRANTS_PRIOR           object
F9_01_PC_GRANTS_PRIOR                 object
F9_01_PC_INDEP_VOTING_MEMB            object
F9_01_PC_INVEST_INCOME_PRIOR          object
F9_01_PC_N

In [304]:
df[Int64_vars].dtypes[50:100]

F9_01_PZ_TOT_ASSETS_BOY              object
F9_01_PZ_TOT_EXP_CURR                object
F9_01_PZ_TOT_LIAB_BOY                object
F9_01_PZ_TOT_REV_CURR                object
F9_04_PC_FR_EVENT_INC_GT_15K         object
F9_04_PC_GAMING_INC_GT_15K           object
F9_04_PC_PROF_FR_EXP_GT_15K          object
F9_06_PC_990_PROVIDED_GOV_BODY       object
F9_06_PC_ANNUAL_DISC_COVRD_PERS      object
F9_06_PC_CEO_COMPENSTN_PROCESS       object
F9_06_PC_CHANGES_ORGANIZING_DOCS     object
F9_06_PC_CONFLICT_OF_INTEREST        object
F9_06_PC_DECISIONS_SUBJ_APPROVAL     object
F9_06_PC_DELEGATION_MGT_DUTIES       object
F9_06_PC_DELEGATION_OF_MGT           object
F9_06_PC_DOCUMENT_RET_POLICY         object
F9_06_PC_ELECTION_BOARD_MEMBERS      object
F9_06_PC_FAMILY_OR_BUSINESS_REL      object
F9_06_PC_FORM_AVAIL_OWN_WEBSITE      object
F9_06_PC_FORM_UPON_REQUEST           object
F9_06_PC_JOINT_VENTURE_INVESTMNT     object
F9_06_PC_JOINT_VENTURE_POLICY        object
F9_06_PC_LOCAL_CHAPTERS         

In [305]:
df[Int64_vars].dtypes[100:150]

F9_08_PC_FUNDRAISING_EVENTS         object
F9_08_PC_FUNDRAISING_GROSS_INC      object
F9_08_PC_GAMING_DIRECT_EXPENSES     object
F9_08_PC_GAMING_GROSS_INCOME        object
F9_08_PC_GOVERNMENT_GRANTS          object
F9_08_PC_GROSS_SALES_INVENTORY      object
F9_08_PC_MEMBERSHIP_DUES            object
F9_08_PC_NONCASH_CONTRIBUTIONS      object
F9_08_PC_PROGRAM_SVCE_REV_TOTAL     object
F9_08_PC_RELATED_ORGANIZATIONS      object
F9_08_PC_TOTAL_CONTRIBUTIONS        object
F9_08_PC_TOTAL_OTHER_REVENUE        object
F9_08_PC_TOTAL_PROG_SVCE_REVENUE    object
F9_08_PC_TOTAL_REVENUE              object
F9_09_PC_COMP_DISQUAL_FUNDRAISE     object
F9_09_PC_COMP_DISQUAL_MGMT          object
F9_09_PC_COMP_DISQUAL_PROG_SVCE     object
F9_09_PC_COMP_DISQUAL_TOTAL         object
F9_09_PC_COMP_OFFICERS_FUNDRAISE    object
F9_09_PC_COMP_OFFICERS_MGMT         object
F9_09_PC_COMP_OFFICERS_PROG_SVCE    object
F9_09_PC_COMP_OFFICERS_TOTAL        object
F9_09_PC_FEES_FOR_SVCE_ACCT_TOT     object
F9_09_PC_FE

In [306]:
df[Int64_vars].dtypes[150:]

F9_10_PC_CASH_NON_INTEREST_BOY      object
F9_10_PC_CASH_NON_INTEREST_EOY      object
F9_10_PC_LAND_BLDG_EQPMT            object
F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN    object
F9_10_PC_LOANS_FROM_OFFICERS_EOY    object
F9_10_PC_ORG_FOLLOWS_SFAS117        object
F9_10_PC_ORG_NOT_FOLLOW_SFAS117     object
F9_10_PC_OTHER_LIABILITIES_EOY      object
F9_10_PC_RET_EARNINGS_ENDWMT_EOY    object
F9_10_PC_SAVINGS_TEMP_INVEST_BOY    object
F9_10_PC_SAVINGS_TEMP_INVEST_EOY    object
F9_10_PC_SECURED_MORTGAGES_EOY      object
F9_10_PC_UNSECURED_NOTES_BOY        object
F9_10_PC_UNSECURED_NOTES_EOY        object
F9_10_PZ_TOTAL_ASSETS_EOY           object
F9_11_PC_RECNCLTN_DONATED_SVCES     object
F9_11_PC_RECNCLTN_INVSTMNT_EXP      object
F9_11_PC_RECNCLTN_PRIOR_PER_ADJ     object
F9_11_PC_RECNCLTN_REV_LESS_EXP      object
F9_11_PC_RECNCLTN_UNRLZD_GAIN       object
F9_12_PC_ACCNT_COMPILE_OR_REVIEW    object
F9_12_PC_ACCTG_METHOD_ACCRUAL       object
F9_12_PC_ACCTG_METHOD_CASH          object
F9_12_PC_AU

In [308]:
#for c in Int64_vars:
#    print("df['%s'] = df['%s'].astype('Int64')" % (c, c))

df['F9_00_HD_ADDR_CHANGE'] = df['F9_00_HD_ADDR_CHANGE'].astype('Int64')
df['F9_00_HD_AMENDED_RETURN'] = df['F9_00_HD_AMENDED_RETURN'].astype('Int64')
df['F9_00_HD_EXEMPT_STATUS_4847A1'] = df['F9_00_HD_EXEMPT_STATUS_4847A1'].astype('Int64')
df['F9_00_HD_EXEMPT_STATUS_501C'] = df['F9_00_HD_EXEMPT_STATUS_501C'].astype('Int64')
df['F9_00_HD_EXEMPT_STATUS_501C3'] = df['F9_00_HD_EXEMPT_STATUS_501C3'].astype('Int64')
df['F9_00_HD_FINAL_RETURN'] = df['F9_00_HD_FINAL_RETURN'].astype('Int64')
df['F9_00_HD_GROSS_RCPT'] = df['F9_00_HD_GROSS_RCPT'].astype('Int64')
df['F9_00_HD_GROUP_RETURN'] = df['F9_00_HD_GROUP_RETURN'].astype('Int64')
df['F9_00_HD_INCLUDES_SUBORD_ORGS'] = df['F9_00_HD_INCLUDES_SUBORD_ORGS'].astype('Int64')
df['F9_00_HD_INITIAL_RETURN'] = df['F9_00_HD_INITIAL_RETURN'].astype('Int64')
df['F9_00_HD_TAX_YEAR'] = df['F9_00_HD_TAX_YEAR'].astype('Int64')
df['F9_00_HD_TYPE_ORG_ASSOCIATION'] = df['F9_00_HD_TYPE_ORG_ASSOCIATION'].astype('Int64')
df['F9_00_HD_TYPE_ORG_CORP'] = df['F9_00_HD_

##### This approach doesn't work:
- df['F9_00_HD_EXEMPT_STATUS_501C'] = df['F9_00_HD_EXEMPT_STATUS_501C'].astype('Int64')

There are problems converting some variables to 'Int64', with the following error message:
    - TypeError: object cannot be converted to an IntegerDtype

Instead, use the following one-liner -- it chooses whether to convert to 'Int64' or 'float'    

In [316]:
%%time
print(len(df), len(df.columns))
df[Int64_vars] = df[Int64_vars].apply(pd.to_numeric)
print(len(df), len(df.columns))
df[:1]

1895016 202
1895016 202
Wall time: 10min 28s


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,F9_09_PC_FEES_FOR_SVCE_FR_TOT,fiscal_year,Filer,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,501c3,F9_00_HD_FILER_STATE_US,ein_int,F9_00_HD_TIME_STAMP_yr
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,,2010,"{'EIN': '232705170', 'Name': {'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}, 'NameControl': 'RONA', 'Phone': '8565826843', 'USAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300...",2016-02-24 21:20:13+00:00,1,,,,,1.0,,,1473903,0,,,MICHAEL ANTON,2011-11-04,,PA,2010-12-31,2010,2011-11-09 06:41:09-06:00,,1.0,,,,,1992.0,0.0,1439340,1044925.0,638637.0,10,30447.0,1753405.0,243131.0,0.0,0,0.0,0.0,89152,193604.0,,2440859,881768.0,195892,0,0.0,450430,1075372.0,0,0.0,10,0,925000,33563,1990429,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,459751,1000,0,0,0.0,1925215.0,1384751,171810.0,1473903,0,0,0,1,1.0,0.0,0,1,0,0,0,0,0,0,,1.0,0,,0,0,0,1.0,1,1.0,10,10,0,0.0,,,,"[PA, NJ, DE]",0,0,0,1.0,0.0,0.0,0,0.0,0.0,0.0,1439340.0,,,,,,,,,,,,,,,1439340.0,1000.0,,1473903,,,,,,,,,21675.0,,215.0,,,,,,,,,,,,,,,,,,,,1384751,195892.0,145115.0,1043744.0,,,,256845.0,86228.0,,1.0,,240077.0,,332660.0,270700.0,,,,2440859,,,,89152.0,,0,1.0,,,1.0,,0.0,1,1,PA,232705170,2011


In [319]:
df[Int64_vars].dtypes[:50]

F9_00_HD_ADDR_CHANGE                    Int64
F9_00_HD_AMENDED_RETURN                 Int64
F9_00_HD_EXEMPT_STATUS_4847A1           Int64
F9_00_HD_EXEMPT_STATUS_501C           float64
F9_00_HD_EXEMPT_STATUS_501C3          float64
F9_00_HD_FINAL_RETURN                 float64
F9_00_HD_GROSS_RCPT                     int64
F9_00_HD_GROUP_RETURN                   int64
F9_00_HD_INCLUDES_SUBORD_ORGS         float64
F9_00_HD_INITIAL_RETURN               float64
F9_00_HD_TAX_YEAR                       int64
F9_00_HD_TYPE_ORG_ASSOCIATION         float64
F9_00_HD_TYPE_ORG_CORP                float64
F9_00_HD_TYPE_ORG_OTHER               float64
F9_00_HD_TYPE_ORG_TRUST               float64
F9_00_HD_YEAR_FORMED                  float64
F9_01_PC_BEN_PAID_MEMB_PRIOR          float64
F9_01_PC_CONTR_GRANTS_CURR              int64
F9_01_PC_CONTR_GRANTS_PRIOR           float64
F9_01_PC_GRANTS_PRIOR                 float64
F9_01_PC_INDEP_VOTING_MEMB              int64
F9_01_PC_INVEST_INCOME_PRIOR      

In [320]:
df[Int64_vars].dtypes[50:100]

F9_01_PZ_TOT_ASSETS_BOY              float64
F9_01_PZ_TOT_EXP_CURR                  int64
F9_01_PZ_TOT_LIAB_BOY                float64
F9_01_PZ_TOT_REV_CURR                  int64
F9_04_PC_FR_EVENT_INC_GT_15K           int64
F9_04_PC_GAMING_INC_GT_15K             int64
F9_04_PC_PROF_FR_EXP_GT_15K            int64
F9_06_PC_990_PROVIDED_GOV_BODY         int64
F9_06_PC_ANNUAL_DISC_COVRD_PERS      float64
F9_06_PC_CEO_COMPENSTN_PROCESS       float64
F9_06_PC_CHANGES_ORGANIZING_DOCS       int64
F9_06_PC_CONFLICT_OF_INTEREST          int64
F9_06_PC_DECISIONS_SUBJ_APPROVAL       int64
F9_06_PC_DELEGATION_MGT_DUTIES         int64
F9_06_PC_DELEGATION_OF_MGT             int64
F9_06_PC_DOCUMENT_RET_POLICY           int64
F9_06_PC_ELECTION_BOARD_MEMBERS        int64
F9_06_PC_FAMILY_OR_BUSINESS_REL        int64
F9_06_PC_FORM_AVAIL_OWN_WEBSITE      float64
F9_06_PC_FORM_UPON_REQUEST           float64
F9_06_PC_JOINT_VENTURE_INVESTMNT       int64
F9_06_PC_JOINT_VENTURE_POLICY        float64
F9_06_PC_L

In [321]:
df[Int64_vars].dtypes[100:150]

F9_08_PC_FUNDRAISING_EVENTS         float64
F9_08_PC_FUNDRAISING_GROSS_INC      float64
F9_08_PC_GAMING_DIRECT_EXPENSES     float64
F9_08_PC_GAMING_GROSS_INCOME        float64
F9_08_PC_GOVERNMENT_GRANTS          float64
F9_08_PC_GROSS_SALES_INVENTORY      float64
F9_08_PC_MEMBERSHIP_DUES            float64
F9_08_PC_NONCASH_CONTRIBUTIONS      float64
F9_08_PC_PROGRAM_SVCE_REV_TOTAL     float64
F9_08_PC_RELATED_ORGANIZATIONS      float64
F9_08_PC_TOTAL_CONTRIBUTIONS        float64
F9_08_PC_TOTAL_OTHER_REVENUE        float64
F9_08_PC_TOTAL_PROG_SVCE_REVENUE    float64
F9_08_PC_TOTAL_REVENUE                int64
F9_09_PC_COMP_DISQUAL_FUNDRAISE     float64
F9_09_PC_COMP_DISQUAL_MGMT          float64
F9_09_PC_COMP_DISQUAL_PROG_SVCE     float64
F9_09_PC_COMP_DISQUAL_TOTAL         float64
F9_09_PC_COMP_OFFICERS_FUNDRAISE    float64
F9_09_PC_COMP_OFFICERS_MGMT         float64
F9_09_PC_COMP_OFFICERS_PROG_SVCE    float64
F9_09_PC_COMP_OFFICERS_TOTAL        float64
F9_09_PC_FEES_FOR_SVCE_ACCT_TOT 

In [322]:
df[Int64_vars].dtypes[150:]

F9_10_PC_CASH_NON_INTEREST_BOY      float64
F9_10_PC_CASH_NON_INTEREST_EOY      float64
F9_10_PC_LAND_BLDG_EQPMT            float64
F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN    float64
F9_10_PC_LOANS_FROM_OFFICERS_EOY    float64
F9_10_PC_ORG_FOLLOWS_SFAS117        float64
F9_10_PC_ORG_NOT_FOLLOW_SFAS117     float64
F9_10_PC_OTHER_LIABILITIES_EOY      float64
F9_10_PC_RET_EARNINGS_ENDWMT_EOY    float64
F9_10_PC_SAVINGS_TEMP_INVEST_BOY    float64
F9_10_PC_SAVINGS_TEMP_INVEST_EOY    float64
F9_10_PC_SECURED_MORTGAGES_EOY      float64
F9_10_PC_UNSECURED_NOTES_BOY        float64
F9_10_PC_UNSECURED_NOTES_EOY        float64
F9_10_PZ_TOTAL_ASSETS_EOY             int64
F9_11_PC_RECNCLTN_DONATED_SVCES     float64
F9_11_PC_RECNCLTN_INVSTMNT_EXP      float64
F9_11_PC_RECNCLTN_PRIOR_PER_ADJ     float64
F9_11_PC_RECNCLTN_REV_LESS_EXP      float64
F9_11_PC_RECNCLTN_UNRLZD_GAIN       float64
F9_12_PC_ACCNT_COMPILE_OR_REVIEW      int64
F9_12_PC_ACCTG_METHOD_ACCRUAL       float64
F9_12_PC_ACCTG_METHOD_CASH      

In [323]:
df[Int64_vars].sample(10)

Unnamed: 0,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_TAX_YEAR,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_FR_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED
713554,,,,,1.0,,40615,0,0.0,,2014,1.0,,,,,,26127,18442.0,4000.0,0,14423.0,636739.0,20709.0,,0,,,-6792,-1135.0,,630349,34000.0,1080,1,,402,32865.0,0,,3,0,14150,14488,629947,24001,0,0,9256,9291.0,637151.0,47407,412.0,40615,0,0,0,1,1.0,1.0,0,1,0,0,0,1,0,0,,1.0,0,,0,0,0,0.0,0,1.0,0,3,0,0.0,,,,1,0,0,1.0,0.0,0.0,0,,,,26127.0,,,,,,,,,,,,,0.0,,26127.0,0.0,0.0,40615,,,,0.0,,,,0.0,861.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,0.0,,4200.0,4200.0,8400.0,,428.0,428.0,856.0,,,,0.0,47407,1080.0,9370.0,36957.0,,59762.0,60726.0,581401.0,162778.0,,1.0,,402.0,,,0.0,,,,630349,,,,-6792.0,,0,1.0,,,,0.0,0
1264947,,,,,1.0,,1601333,0,,,2016,,1.0,,,2000.0,,1516081,1458911.0,,4,1.0,1078652.0,235522.0,66490.0,0,,,86158,164228.0,,1298848,1361174.0,3318,28,82.0,110042,1525402.0,0,,4,0,0,1,1188806,256783,85251,0,1258392,1125652.0,1243533.0,1515175,164881.0,1601333,0,0,0,1,1.0,1.0,0,1,0,0,0,1,0,0,,1.0,0,,0,0,0,1.0,1,1.0,4,4,1,0.0,,,,0,0,0,,,,0,10428.0,88789.0,,130.0,1576.0,,,,1576.0,,,,1514375.0,,,,,,1516081.0,85251.0,,1601333,,,,,0.0,89273.0,0.0,89273.0,30713.0,,,,,,34437.0,0.0,9867.0,100084.0,109951.0,0.0,96126.0,796464.0,892590.0,0.0,17324.0,77375.0,94699.0,0.0,15457.0,56422.0,71879.0,1515175,3318.0,408331.0,1103526.0,,532218.0,620790.0,1054836.0,392269.0,,1.0,,,,6138.0,6139.0,,163521.0,106647.0,1298848,,,,86158.0,,1,,,1.0,1.0,1.0,1
1201279,,,,,1.0,,127936,0,,,2016,,1.0,,,1960.0,,0,,,12,32.0,48785.0,71169.0,63326.0,0,,7697.0,146,-114.0,,48931,71169.0,0,0,80.0,0,71055.0,31,,12,0,0,31,48931,61648,58160,3603,0,,48785.0,61648,,61794,1,0,0,1,,0.0,0,0,1,0,0,0,1,0,,1.0,0,,0,0,1,1.0,1,,12,12,0,0.0,,,,0,0,0,1.0,,,0,,,,,,,,66142.0,,124302.0,,,,,,,3603.0,,,,3603.0,61794,,,,,,,,,1500.0,,,,,,,,,,,,,,,,,,,,,,,61648,0.0,6414.0,55234.0,,13309.0,13427.0,,,,1.0,,,,35476.0,35504.0,,,,48931,,,,146.0,,0,,1.0,,,0.0,0
1847531,,,,,1.0,,41056602,0,,,2018,,1.0,,,1885.0,0.0,36660592,27695313.0,0.0,13,401020.0,13920365.0,17786353.0,71556.0,0,0.0,2644340.0,28945754,9221650.0,,46012567,21590579.0,1944848,165,2300.0,3115953,30812229.0,0,0.0,14,0,0,229086,42896614,6830922,71650,2887833,4072485,3804226.0,17783865.0,10903407,3863500.0,39849161,0,0,1,1,1.0,1.0,0,1,0,0,0,1,0,0,1.0,1.0,0,,0,0,0,1.0,1,1.0,13,14,0,0.0,,1.0,,1,0,0,,1.0,4.0,1,52272.0,591232.0,0.0,36660592.0,,,,,,,,,,,,3265433.0,2887833.0,,36660592.0,25062.0,2887833.0,39849161,,,,,77413.0,51741.0,369473.0,498627.0,40000.0,,,36904.0,,,156330.0,77610.0,32149.0,429661.0,539420.0,426777.0,228444.0,2093687.0,2748908.0,35863.0,16865.0,171677.0,224405.0,13825.0,7786.0,39514.0,61125.0,10903407,1944848.0,584349.0,8374210.0,,2000.0,2609.0,1238779.0,499946.0,,1.0,,38393.0,,350879.0,3130086.0,2500000.0,,,46012567,,,,28945754.0,31095.0,0,1.0,,1.0,,0.0,1
498685,,,,5.0,,,248520,0,,,2013,,1.0,,,1967.0,,248520,,,0,,,,,0,,,103390,,,103390,,0,0,20.0,0,,0,,1872,0,0,0,103390,145130,0,0,0,,,145130,,248520,0,0,0,0,,0.0,0,0,0,0,0,0,0,0,,1.0,0,,0,0,0,1.0,1,,0,1872,0,0.0,,,,0,0,0,1.0,,,0,,,,,,,,,,,,,,,248520.0,,,,248520.0,,,248520,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,145130,,,,,,,,,,,1.0,,103390.0,,103390.0,,,,103390,,,,103390.0,,1,,1.0,1.0,,0.0,1
421292,,,,4.0,,,323112,0,,1.0,2013,,1.0,,,2013.0,,323112,,,3,,,,,0,,,300,,,300,,0,0,3.0,0,,0,0.0,3,0,0,0,300,322812,0,0,0,,,322812,,323112,0,0,0,1,,0.0,0,0,0,0,0,0,0,0,,1.0,0,,0,0,0,1.0,1,,3,3,0,0.0,,,,0,0,0,1.0,0.0,0.0,0,0.0,0.0,0.0,323112.0,,,,,,,,,,,,,,,323112.0,,,323112,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,322812,0.0,0.0,322812.0,,,300.0,,,,1.0,,,,,,,,,300,,,,300.0,,0,,1.0,,,0.0,0
657761,,,,,1.0,,970126,0,,,2014,1.0,,,,2007.0,,0,,,4,9271.0,1221870.0,788763.0,,0,,832726.0,152825,53234.0,,1377650,788763.0,0,0,,0,841997.0,0,,4,0,0,20605,1377650,815401,0,949521,1900,,1221870.0,817301,,970126,0,0,0,1,0.0,0.0,0,1,0,0,0,0,0,0,,,0,,0,0,0,1.0,1,0.0,4,4,0,0.0,,,,0,0,0,,,,0,1900.0,,,,,,,,,,,,,,,,949521.0,,,,949521.0,970126,,,,,,,1900.0,1900.0,5513.0,,,1568.0,,,5442.0,,,,,,,,,,,,,,,,,817301,0.0,10238.0,807063.0,,838960.0,971180.0,,,,,1.0,,1377650.0,311755.0,335315.0,,,,1377650,,,,152825.0,,0,,1.0,,,0.0,0
923808,,,,,1.0,,467863,0,,,2015,,,,,2002.0,,9848,15374.0,15568.0,0,33253.0,340610.0,754.0,,0,,,21968,28691.0,,362083,19936.0,0,0,,0,48627.0,0,0.0,0,0,8496,27345,362083,3767,0,0,2962,3614.0,340610.0,15225,0.0,37193,0,0,0,0,,0.0,0,0,0,0,0,0,0,0,,1.0,0,,0,0,0,0.0,0,,0,0,0,0.0,1.0,,,0,0,0,1.0,,,0,0.0,0.0,0.0,9848.0,,,,,,,,,,,,,,,9848.0,,,37193,,,,,,2962.0,,2962.0,1121.0,,,,,,2546.0,,,,,,,,,,,,,,,,,15225,0.0,6729.0,8496.0,,,,,,,,1.0,,362083.0,14744.0,13109.0,,,,362083,,,,21968.0,,0,,1.0,,,0.0,0
85819,,,,3.0,,,575547,0,,,2009,,1.0,,,1962.0,,274436,255159.0,,20,18643.0,287125.0,223426.0,-1066.0,0,,300788.0,10674,29754.0,,312609,543770.0,5175,12,25.0,12733,573524.0,0,0.0,20,0,0,8371,299876,230445,27273,259190,328151,320344.0,302707.0,558596,15582.0,569270,1,0,0,1,,1.0,0,0,0,0,0,0,0,1,,1.0,0,,0,0,0,1.0,1,,20,20,0,1.0,,,,0,0,0,,0.0,0.0,0,2778.0,36800.0,0.0,81736.0,,,,6277.0,,33550.0,,,11260.0,,,,259190.0,181440.0,274436.0,,259190.0,569270,,,,,,39612.0,,39612.0,19500.0,,,,,,13098.0,,88.0,212.0,300.0,,49648.0,215196.0,264844.0,,5408.0,13039.0,18447.0,,1451.0,3497.0,4948.0,558596,5175.0,155546.0,397875.0,,,,55196.0,53925.0,,1.0,,,,225138.0,264998.0,,,,312609,,,,,,0,1.0,,1.0,,0.0,1
590869,,,,9.0,,,4762645,0,,,2012,,1.0,,,1986.0,3806259.0,0,0.0,0.0,4,1294.0,487188.0,368583.0,0.0,0,0.0,4871287.0,-451402,323823.0,,647446,4548758.0,0,4,0.0,598936,4872581.0,0,0.0,4,4453827,0,963,48510,384250,15000,4746682,375970,373916.0,1099106.0,5214047,611918.0,4762645,0,0,0,1,0.0,0.0,0,1,0,0,0,0,0,1,,1.0,0,,0,0,0,1.0,1,0.0,4,4,0,0.0,,,,0,0,0,,0.0,0.0,1,0.0,0.0,410785.0,,,,,,,,,,,,,,4746682.0,,,15000.0,4746682.0,4762645,,,,,,,,,40475.0,,,32903.0,,,,,,,39600.0,,,,279872.0,,,,26178.0,,,,30320.0,5214047,,,,,570971.0,201695.0,265685.0,244322.0,,,1.0,,48510.0,,,,,,647446,,,,-451402.0,,0,1.0,,0.0,,0.0,1


In [324]:
%%time
df.describe().T

Wall time: 15.1 s


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
F9_09_PC_FEES_FOR_SVCE_FR_TOT,435088.0,1.912814e+04,2.442569e+05,-35000.0,0.000000e+00,0.0,0.0,32764282.0
F9_00_HD_ADDR_CHANGE,74526.0,1.000000e+00,0.000000e+00,1.0,1.000000e+00,1.0,1.0,1.0
F9_00_HD_AMENDED_RETURN,16642.0,1.000000e+00,0.000000e+00,1.0,1.000000e+00,1.0,1.0,1.0
F9_00_HD_EXEMPT_STATUS_4847A1,1514.0,1.000000e+00,0.000000e+00,1.0,1.000000e+00,1.0,1.0,1.0
F9_00_HD_EXEMPT_STATUS_501C,486371.0,7.187754e+00,4.022294e+00,2.0,5.000000e+00,6.0,8.0,29.0
...,...,...,...,...,...,...,...,...
F9_12_PC_FED_GRNT_AUDIT_PERFORMD,234488.0,7.727517e-01,4.190552e-01,0.0,1.000000e+00,1.0,1.0,1.0
F9_12_PC_FED_GRNT_AUDIT_REQUIRED,1718321.0,1.058498e-01,3.076454e-01,0.0,0.000000e+00,0.0,0.0,1.0
F9_12_PC_FINCL_STMTS_AUDITED,1895016.0,4.788366e-01,4.995520e-01,0.0,0.000000e+00,0.0,1.0,1.0
501c3,1895016.0,7.574976e-01,4.285967e-01,0.0,1.000000e+00,1.0,1.0,1.0


#### Save DF

In [325]:
%%time
df.to_pickle('all filings nov. 2020 - all control variables (with parsed sub-key variables and reformatted types).pkl')

Wall time: 1min 17s


In [10]:
concordance[:1]

Unnamed: 0,xpath,project,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,python_data_type,BINARIZE,MongoDB_Name,sub_key,sub_sub_key
0,/Return/ReturnHeader/TaxPeriodEndDate,,F9_00_HD_TAX_PER_END,,,Will be nested under ReturnHeader,,Tax period end date,HEADER,HD,DateType,,,TaxPeriodEndDate,,


In [12]:
#concordance = concordance.drop('python_data_type', 1)

In [9]:
new_variables_df[:1]

Unnamed: 0,variable_name_new,original_names,sub_keys,data_type_xsd,len,python_data_type
0,F9_00_HD_ADDR_CHANGE,"[AddressChange, AddressChangeInd]",[nan],CheckboxType,2,Int64


In [14]:
pd.merge(concordance, new_variables_df, left_on='variable_name_new', right_on='variable_name_new', how='left', indicator=True)['_merge'].value_counts()

both          384
right_only      0
left_only       0
Name: _merge, dtype: int64

In [16]:
print(len(concordance))
print(len(pd.merge(concordance, new_variables_df, left_on='variable_name_new', right_on='variable_name_new', how='left', indicator=True)))
merged = pd.merge(concordance, new_variables_df[['variable_name_new', 'python_data_type']], left_on='variable_name_new', right_on='variable_name_new', how='left', indicator=True)
print(len(merged))
print(merged['_merge'].value_counts())
merged = merged.drop('_merge', 1)
merged[:2]

384
384
384
both          384
right_only      0
left_only       0
Name: _merge, dtype: int64


Unnamed: 0,xpath,project,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,BINARIZE,MongoDB_Name,sub_key,sub_sub_key,python_data_type
0,/Return/ReturnHeader/TaxPeriodEndDate,,F9_00_HD_TAX_PER_END,,,Will be nested under ReturnHeader,,Tax period end date,HEADER,HD,DateType,,TaxPeriodEndDate,,,string
1,/Return/ReturnHeader/TaxPeriodEndDt,,F9_00_HD_TAX_PER_END,,,Will be nested under ReturnHeader,,Tax period end date,HEADER,HD,DateType,,TaxPeriodEndDt,,,string


In [18]:
merged['python_data_type'].value_counts()

Int64       354
string       27
DateTime      3
Name: python_data_type, dtype: int64

#### Save concordance file with *python_data_type*

In [21]:
print(merged.columns.tolist())

['xpath', 'project', 'variable_name_new', '# of Characters (newly named)', 'variable name notes', 'PARSING NOTES', 'OTHER NOTES', 'description', 'location_code', 'part', 'data_type_xsd', 'BINARIZE', 'MongoDB_Name', 'sub_key', 'sub_sub_key', 'python_data_type']


In [22]:
merged = merged[['xpath', 'project', 'variable_name_new', '# of Characters (newly named)', 'variable name notes', 
                 'PARSING NOTES', 'OTHER NOTES', 'description', 'location_code', 'part', 'data_type_xsd',
                 'python_data_type', 'BINARIZE', 'MongoDB_Name', 'sub_key', 'sub_sub_key']]
merged[:1]

Unnamed: 0,xpath,project,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,python_data_type,BINARIZE,MongoDB_Name,sub_key,sub_sub_key
0,/Return/ReturnHeader/TaxPeriodEndDate,,F9_00_HD_TAX_PER_END,,,Will be nested under ReturnHeader,,Tax period end date,HEADER,HD,DateType,string,,TaxPeriodEndDate,,


### Look at 501(c)(3)s
This section is not needed here 

In [98]:
print('# of columns:', len(df.columns))
print('# of observations:', len(df))

# of columns: 200
# of observations: 1895016


In [99]:
df['501c3'].value_counts()

1    1435470
0     459546
Name: 501c3, dtype: int64

In [101]:
print(len(df[df['501c3']==1]))

1435470


#### Create and save list of EINs for BMF File
In previous round I believe there were only 296,334 EINs (though that may have only been for valid BMF EINs.

In [10]:
ein_list = df[df['501c3']==1]['EIN'].tolist()
print(len(ein_list))
print(len(set(ein_list)))
ein_list = list(set(ein_list))
print(len(ein_list))

1435470
257049
257049


In [11]:
import json
with open('ein_list_501c3.json', 'w') as fp:
    json.dump(ein_list, fp)