# Overview

Read in DF: `D:/990_and_bmf_april_2025_all_controls_351875_orgs_2598477_filings_no_duplicates_fixed_state_ntee.feather`
- It was built in this notebook: `IRS 990 e-File Data -- IRS Files (8c2) -- Fix Industry Variables (Jesse's method).ipynb`

Note that state and COUNTY_CODE were fixed in this notebook: `IRS 990 e-File Data -- IRS Files (8b) -- Fix State Variable and create COUNTY_CODE.ipynb`
- In that notebook I noted that:
  - Fixed state variable and create `COUNTY_CODE` from `BMF_CENSUS_BLOCK_FIPS`
  - State variable to use is `BMF_F990_ORG_ADDR_STATE` 

In this notebok I then do the following:
- Rationalize City
  - I end up using the 990 variable `F9_00_HD_FILER_CITY_US`
- Rationalize ZIP variable
  - I end up using the 990 variable `F9_00_HD_FILER_ZIP_US` and creating a five-character version `ZIP5`
- Rationalize Street Address
  - I end up using the 990 variable `F9_00_HD_FILER_ADDR_US_L1`
- I replace some COUNTY_CODE '00000' values with `np.nan`

The final list of geo variables to use is in this list of `geo_cols`:
- `BMF_F990_ORG_ADDR_STATE`
- `COUNTY_CODE`
- `F9_00_HD_FILER_CITY_US`
- `ZIP5`
- `F9_00_HD_FILER_ADDR_US_L1`


I then save the dataset in various formats (here is the feather version): 
- `D:/990_and_bmf_april_2025_all_controls_351875_orgs_2598477_filings_no_duplicates_fixed_state_ntee_zip.feather`

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

In [2]:
print(pd.__version__)

2.2.2


In [3]:
from platform import python_version
print(python_version())

3.10.11


In [4]:
# http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
# http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_colwidth', 500)

In [5]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [6]:
pd.options.display.float_format = '{:,.2f}'.format

In [7]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
warnings.simplefilter(action='ignore', category=pd.errors.SettingWithCopyWarning)

#### Set working directory

In [8]:
cd "C:\\Users\\Gregory\\IRS 990 Control Variables\\"

C:\Users\Gregory\IRS 990 Control Variables


# Read in DF

In [9]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df = pd.read_feather('D:/990_and_bmf_april_2025_all_controls_351875_orgs_2598477_filings_no_duplicates_fixed_state_ntee.feather')
print('# of columns:', len(df.columns))
print('# of observations:', len(df))
df[:1]

Current date and time :  2025-06-26 23:03:39 

# of columns: 361
# of observations: 2598477
CPU times: total: 6min 22s
Wall time: 1min 6s


Unnamed: 0,EIN,F9_00_HD_TAX_YEAR,_id,OrganizationName,URL,DLN,TaxPeriod,F9_09_PC_FEES_FOR_SVCE_FR_TOT,F9_00_HD_BUILD_TIME_STAMP,fiscal_year,Name,NameControl,Phone,USAddress,ForeignAddress,InCareOfName,BusinessName,BusinessNameControlTxt,PhoneNum,InCareOfNm,ForeignPhoneNum,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_BEGIN,F9_00_HD_TAX_PER_END,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_03_PC_PGMSVC_SIGNIF_CHG,F9_03_PC_PGMSVC_SIGNIF_NEW,F9_03_PC_PROG_SVC_ACC_1_CODE,F9_03_PC_PROG_SVC_ACC_1_DESC,F9_03_PC_PROG_SVC_ACC_1_EXP,F9_03_PC_PROG_SVC_ACC_1_GRNT,F9_03_PC_PROG_SVC_ACC_1_REV,F9_03_PC_PROG_SVC_ACC_2_CODE,F9_03_PC_PROG_SVC_ACC_2_DESC,F9_03_PC_PROG_SVC_ACC_2_EXP,F9_03_PC_PROG_SVC_ACC_2_GRNT,F9_03_PC_PROG_SVC_ACC_2_REV,F9_03_PC_PROG_SVC_ACC_3_CODE,F9_03_PC_PROG_SVC_ACC_3_DESC,F9_03_PC_PROG_SVC_ACC_3_EXP,F9_03_PC_PROG_SVC_ACC_3_GRNT,F9_03_PC_PROG_SVC_ACC_3_REV,F9_03_PC_TOT_OTH_PROG_SVC_EXP,F9_03_PC_TOT_OTH_PROG_SVC_GRNT,F9_03_PC_TOT_OTH_PROG_SVC_REV,F9_03_PC_TOT_PROG_SVC_EXPENSE,F9_03_PZ_MISSION_DESCRIPTION,F9_03_PZ_SCHEDULE_O_PART3,F9_04_PC_ACTVITIES_VIA_PARTNER,F9_04_PC_CONTROLLED_ENTITY,F9_04_PC_DISREGARDED_ENTITY,F9_04_PC_EXCESS_BENEFIT_TRANS,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_LOBBYING_ACTIVITIES,F9_04_PC_POLITICAL_ACTIVITIES,F9_04_PC_PRIOR_EXCESS_BEN_TRAN,F9_04_PC_PROF_FR_EXP_GT_15K,F9_04_PC_RELATED_ENTITY,F9_04_PC_TRANS_TO_CNTRLD_ENT,F9_04_PC_TRANS_WITH_CNTRLD_ENT,F9_05_EXP_SCHED_O_X,F9_05_PC_NUMBER_EMPLOYEES_W3,F9_05_PC_NUMBER_FORMS_1096,F9_05_PC_UNRELATED_BUS_INCOME,F9_06_EXP_SCHED_O_X,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_EXP_SCHED_O_X,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_EXP_SCHED_O_X,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_EXP_AD_PROMO_TOT,F9_09_EXP_BENF_PAID_MEMB_TOT,F9_09_EXP_CONF_MEETING_TOT,F9_09_EXP_DEPREC_FUNDR,F9_09_EXP_DEPREC_MAG,F9_09_EXP_DEPREC_PROG,F9_09_EXP_DEPREC_TOT,F9_09_EXP_GRANT_FRGN_TOT,F9_09_EXP_GRANT_INDIV_DMSTC_TOT,F9_09_EXP_GRANT_ORG_DMSTC_TOT,F9_09_EXP_INFO_TECH_TOT,F9_09_EXP_INSURANCE_TOT,F9_09_EXP_INTEREST_TOT,F9_09_EXP_JOINT_COSTS_TOT,F9_09_EXP_OCCUPANCY_TOT,F9_09_EXP_OFFICE_TOT,F9_09_EXP_OTH_OTH_TOT,F9_09_EXP_ROY_TOT,F9_09_EXP_SCHED_O_X,F9_09_EXP_TRAVEL_ENTRTNMNT_TOT,F9_09_EXP_TRAVEL_TOT,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYMENT_TO_AFFILIATES,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_ASSETS_ACC_NET_EOY,F9_10_ASSETS_EXP_PREPAID_EOY,F9_10_ASSETS_INTANGIB_EOY,F9_10_ASSETS_INVENT_SALE_EOY,F9_10_ASSETS_LESS_DEPREC_EOY,F9_10_ASSETS_LOANS_DISQUAL_EOY,F9_10_ASSETS_NOTES_LOANS_NET_EOY,F9_10_ASSETS_OTH_EOY,F9_10_ASSETS_PLEDGES_NET_EOY,F9_10_LIAB_ACC_PAYABLE_EOY,F9_10_LIAB_GRANTS_PAYABLE_EOY,F9_10_LIAB_LOANS_OFF_EOY,F9_10_LIAB_REV_DEFERRED_EOY,F9_10_NAFB_RESTRICT_PERM_EOY,F9_10_NAFB_RESTRICT_TEMP_EOY,F9_10_NAFB_UNRESTRICT_EOY,F9_10_PC_BOND_LIABILITY_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_ESCROW_LIABILITY_EOY,F9_10_PC_INVEST_OTHER_SEC_EOY,F9_10_PC_INVEST_PROG_RELTD_EOY,F9_10_PC_INVEST_PUB_TRADED_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_SECURE_MORT_NOTES_EOY,F9_10_PC_UNSECURED_LOANS_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_10_SCHED_O_X,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_11_SCHED_O_X,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,F9_12_SCHED_O_X,number_of_other_prog_svces,501c3,F9_00_HD_FILER_ADDR_US_L1,F9_00_HD_FILER_ADDR_US_L2,F9_00_HD_FILER_CITY_US,F9_00_HD_FILER_ZIP_US,F9_00_HD_FILER_COUNTRY_FRGN,F9_00_HD_FILER_STATE_US,F9_00_HD_TIME_STAMP_yr,ein_int,BMF_EIN2,BMF_EIN,BMF_NTEE_IRS,BMF_NTEE_NCCS,BMF_NTEEV2,BMF_NCCS_LEVEL_1,BMF_NCCS_LEVEL_2,BMF_NCCS_LEVEL_3,BMF_F990_TOTAL_REVENUE_RECENT,BMF_F990_TOTAL_INCOME_RECENT,BMF_F990_TOTAL_ASSETS_RECENT,BMF_F990_ORG_ADDR_CITY,BMF_F990_ORG_ADDR_STATE,BMF_F990_ORG_ADDR_ZIP,BMF_F990_ORG_ADDR_STREET,BMF_CENSUS_CBSA_FIPS,BMF_CENSUS_CBSA_NAME,BMF_CENSUS_BLOCK_FIPS,BMF_CENSUS_URBAN_AREA,BMF_CENSUS_STATE_ABBR,BMF_CENSUS_COUNTY_NAME,BMF_ORG_ADDR_FULL,BMF_ORG_ADDR_MATCH,BMF_LATITUDE,BMF_LONGITUDE,BMF_GEOCODER_SCORE,BMF_GEOCODER_MATCH,BMF_BMF_SUBSECTION_CODE,BMF_BMF_STATUS_CODE,BMF_BMF_PF_FILING_REQ_CODE,BMF_BMF_ORGANIZATION_CODE,BMF_BMF_INCOME_CODE,BMF_BMF_GROUP_EXEMPT_NUM,BMF_BMF_FOUNDATION_CODE,BMF_BMF_FILING_REQ_CODE,BMF_BMF_DEDUCTIBILITY_CODE,BMF_BMF_CLASSIFICATION_CODE,BMF_BMF_ASSET_CODE,BMF_BMF_AFFILIATION_CODE,BMF_ORG_RULING_DATE,BMF_ORG_FISCAL_YEAR,BMF_ORG_RULING_YEAR,BMF_ORG_YEAR_FIRST,BMF_ORG_YEAR_LAST,BMF_ORG_YEAR_COUNT,BMF_ORG_PERS_ICO,BMF_ORG_NAME_SEC,BMF_ORG_NAME_CURRENT,BMF_ORG_FISCAL_PERIOD,filing_year_had_duplicate,COUNTY_CODE,NTEE,NTEE_MAJ12,NTEE_MAJ12_EV
0,10017496,2022,65c1a1d52a9ba8ce45342904,,https://s3.amazonaws.com/irs-form-990/202323189349305317_public.xml,,,0,2023-04-26 12:10:37+00:00,2022,,,,"{'AddressLine1': None, 'AddressLine1Txt': 'PO BOX 534', 'AddressLine2': None, 'AddressLine2Txt': None, 'City': None, 'CityNm': 'YORK HARBOR', 'State': None, 'StateAbbreviationCd': 'ME', 'ZIPCd': '03911', 'ZIPCode': None}",,,{'BusinessNameLine1Txt': 'AGAMENTICUS YACHT CLUB INC'},AGAM,2073638510,,,0,0,,0,,1,0,,376800,0,0,0,DANIEL FORD,2023-11-13,,ME,2022-01-01,2022-12-31,2023-11-14 16:30:26+00:00,0,1,0,,0,WWW.AYCSAIL.ORG,1937,0,279970,0,0,13,0,273331,0,0,0,0,0,184620,0,0,413907,0,2744,16,20,8282,0,0,0,13,0,0,34628,405625,"THE ORGANIZATION'S PRIMARY EXEMPT PURPOSE IS TO TEACH SAILING TO CHILDREN BY FOCUSING ON SAFETY, ENJOYMENT AND KNOWLEDGE OF SAILING.",132172,3377,54843,56026,0,273331,188198,0,372818,0,0,,"PROVIDES SAILING INSTRUCTION, SEAMAN-SHIP AND WATER SAFETY SKILLS TO CHILDREN.",167950,0,54843,,,0,0,0,,,0,0,0,0,0,0,167950,"THE ORGANIZATION'S PRIMARY EXEMPT PURPOSE IS TO TEACH YOUNGSTERS THE BASICS OF SAILING, SEAMAN-SHIP AND SAFE CONDUCT ON THE WATER. IT IS THE ORGANIZATION'S MISSION TO CREATE AND SUSTAIN A COMMUNITY OF FAMILIES WHO ENJOY BEING ON THE WATER.",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,16,2,0,1,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,1,0,13,13,0,0,0,0,0,,0,0,0,0,1,0,0,0,0,0,0,0,236965,0,0,0,0,0,0,0,0,0,3377,43005,0,54843,0,279970,0,54843,372818,0,0,0,0,0,16519,16519,0,0,0,84,21088,0,0,2776,1000,28263,0,1,0,0,0,0,0,0,0,0,0,0,6748,0,0,0,0,0,0,0,0,0,0,0,52045,52045,0,0,0,3981,3981,0,0,0,0,188198,2744,17504,167950,0,0,30000,0,174902,0,0,0,0,682,0,0,7600,0,0,0,0,29306,125682,0,0,0,83323,453817,278915,0,0,0,0,0,0,0,0,0,0,0,0,413907,0,0,0,0,184620,-52326,0,0,1,0,,0,0,0,0,0,,1,PO BOX 534,,YORK HARBOR,3911,,ME,2023,10017496,EIN-01-0017496,10017496,N50,N50,HMS-N50-RG,501C3 CHARITY,O,HS,372818.0,376800.0,413907.0,YORK HARBOR,ME,03911-0534,PO BOX 534,38860,"Portland-South Portland, ME",230310360032023,U,ME,York County,"PO BOX 534,YORK HARBOR,ME,03911-0534","03911-0534, York Harbor, Maine",43.13,-70.64,98.0,M,3.0,1.0,0.0,1.0,4.0,0.0,15.0,1.0,1.0,2000.0,4.0,3.0,1993-03,2024.0,1993.0,1995.0,2024.0,30.0,,,AGAMENTICUS YACHT CLUB OF YORK,3.0,0,23031,N50,HU,HMS


In [19]:
print(len(df))
print(len(set(df['EIN'].tolist())))
print(df['501c3'].value_counts())

2598477
351875
501c3
1    2598477
Name: count, dtype: int64


# Inspect Address Columns

#### The two finalized variables from prior notebook

In [29]:
geo_cols = ['COUNTY_CODE', 'BMF_F990_ORG_ADDR_STATE']
df[geo_cols].info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2598477 entries, 0 to 2598476
Data columns (total 2 columns):
 #   Column                   Non-Null Count    Dtype 
---  ------                   --------------    ----- 
 0   COUNTY_CODE              2598477 non-null  object
 1   BMF_F990_ORG_ADDR_STATE  2597278 non-null  object
dtypes: object(2)
memory usage: 39.6+ MB


In [31]:
df[geo_cols+['F9_00_HD_FILER_STATE_US']].isna().sum()

COUNTY_CODE                   0
BMF_F990_ORG_ADDR_STATE    1199
F9_00_HD_FILER_STATE_US    3155
dtype: int64

In [36]:
print(len(df[(df['BMF_F990_ORG_ADDR_STATE'].isnull())&(df['F9_00_HD_FILER_STATE_US'].isnull())]))
print(len(df[(df['BMF_F990_ORG_ADDR_STATE'].isnull())&(df['F9_00_HD_FILER_STATE_US'].notnull())]))

1199
0


### Now turn to the other variables

In [None]:
bmf_cols = [c for c in df.columns if 'BMF_' in c]
bmf_cols

In [20]:
[c for c in df.columns if 'ADD' in c.upper()]

['USAddress',
 'ForeignAddress',
 'F9_00_HD_ADDR_CHANGE',
 'F9_06_PC_OFFICER_MAILING_ADDRESS',
 'F9_00_HD_FILER_ADDR_US_L1',
 'F9_00_HD_FILER_ADDR_US_L2',
 'BMF_F990_ORG_ADDR_CITY',
 'BMF_F990_ORG_ADDR_STATE',
 'BMF_F990_ORG_ADDR_ZIP',
 'BMF_F990_ORG_ADDR_STREET',
 'BMF_ORG_ADDR_FULL',
 'BMF_ORG_ADDR_MATCH']

###### These two variables contain the full address from the 990 -- these have been parsed already and can likely be deleted

In [21]:
df[['USAddress', 'ForeignAddress']].sample(5)

Unnamed: 0,USAddress,ForeignAddress
147127,"{'AddressLine1': None, 'AddressLine1Txt': '43-50 MAIN STREET', 'AddressLine2': None, 'AddressLine2Txt': None, 'City': None, 'CityNm': 'FLUSHING', 'State': None, 'StateAbbreviationCd': 'NY', 'ZIPCd': '11355', 'ZIPCode': None}",
1548155,"{'AddressLine1': None, 'AddressLine1Txt': '550 JUSTISON STREET', 'AddressLine2': None, 'AddressLine2Txt': None, 'City': None, 'CityNm': 'WILMINGTON', 'State': None, 'StateAbbreviationCd': 'DE', 'ZIPCd': '19801', 'ZIPCode': None}",
420863,"{'AddressLine1': None, 'AddressLine1Txt': '807 CAMP HORNE ROAD', 'AddressLine2': None, 'AddressLine2Txt': None, 'City': None, 'CityNm': 'PITTSBURGH', 'State': None, 'StateAbbreviationCd': 'PA', 'ZIPCd': '15327', 'ZIPCode': None}",
1972326,"{'AddressLine1': None, 'AddressLine1Txt': '843 CAMP STREET', 'AddressLine2': None, 'AddressLine2Txt': None, 'City': None, 'CityNm': 'NEW ORLEANS', 'State': None, 'StateAbbreviationCd': 'LA', 'ZIPCd': '701303751', 'ZIPCode': None}",
2358727,"{'AddressLine1': None, 'AddressLine1Txt': 'PO Box 820023', 'AddressLine2': None, 'AddressLine2Txt': None, 'City': None, 'CityNm': 'Portland', 'State': None, 'StateAbbreviationCd': 'OR', 'ZIPCd': '97282', 'ZIPCode': None}",


### Rationalize City
Note that in all cases where `F9_00_HD_FILER_CITY_US` is missing but `BMF_F990_ORG_ADDR_CITY` isn't, it is always because it is a foreign address. So, I think we can just use the 990 variable `F9_00_HD_FILER_CITY_US`

Note also that in the sample below, the 990 city is 'WISCONSIN RAPIDS' but the BMF city is 'WISC RAPIDS'

In [37]:
[c for c in df.columns if 'CITY' in c.upper()]

['F9_00_HD_FILER_CITY_US', 'BMF_F990_ORG_ADDR_CITY']

In [48]:
df[['F9_00_HD_FILER_CITY_US', 'BMF_F990_ORG_ADDR_CITY']].info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2598477 entries, 0 to 2598476
Data columns (total 2 columns):
 #   Column                  Non-Null Count    Dtype 
---  ------                  --------------    ----- 
 0   F9_00_HD_FILER_CITY_US  2595322 non-null  object
 1   BMF_F990_ORG_ADDR_CITY  2594238 non-null  object
dtypes: object(2)
memory usage: 39.6+ MB


In [46]:
df[['F9_00_HD_FILER_CITY_US', 'BMF_F990_ORG_ADDR_CITY']].sample(5)

Unnamed: 0,F9_00_HD_FILER_CITY_US,BMF_F990_ORG_ADDR_CITY
2250894,HUGO,HUGO
685358,KANSAS CITY,KANSAS CITY
410232,WISCONSIN RAPIDS,WISC RAPIDS
749310,LAUDERHILL,LAUDERHILL
1677510,CARY,CARY


In [45]:
print(len(df[(df['F9_00_HD_FILER_CITY_US'].isnull())&(df['BMF_F990_ORG_ADDR_CITY'].isnull())]))
print(len(df[(df['F9_00_HD_FILER_CITY_US'].notnull())&(df['BMF_F990_ORG_ADDR_CITY'].isnull())]))
print(len(df[(df['F9_00_HD_FILER_CITY_US'].isnull())&(df['BMF_F990_ORG_ADDR_CITY'].notnull())]))
print(len(df[(df['F9_00_HD_FILER_CITY_US'].isnull())&(df['BMF_F990_ORG_ADDR_CITY'].notnull())&(df['ForeignAddress'].notnull())]))

12
4227
3143
3143


In [40]:
df[(df['F9_00_HD_FILER_CITY_US'].isnull())&(df['BMF_F990_ORG_ADDR_CITY'].isnull())][['F9_00_HD_FILER_CITY_US', 'BMF_F990_ORG_ADDR_CITY']].sample(5)

Unnamed: 0,F9_00_HD_FILER_CITY_US,BMF_F990_ORG_ADDR_CITY
2392544,,
2590389,,
2590388,,
2392541,,
1073329,,


In [41]:
df[(df['F9_00_HD_FILER_CITY_US'].notnull())&(df['BMF_F990_ORG_ADDR_CITY'].isnull())][['F9_00_HD_FILER_CITY_US', 'BMF_F990_ORG_ADDR_CITY']].sample(5)

Unnamed: 0,F9_00_HD_FILER_CITY_US,BMF_F990_ORG_ADDR_CITY
2474359,Lincoln,
2452241,OVERLAND PARK,
716274,FAIRVIEW,
1950374,Portland,
2385821,Farmingville,


In [43]:
df[(df['F9_00_HD_FILER_CITY_US'].isnull())&(df['BMF_F990_ORG_ADDR_CITY'].notnull())][['F9_00_HD_FILER_CITY_US', 'BMF_F990_ORG_ADDR_CITY']].sample(5)

Unnamed: 0,F9_00_HD_FILER_CITY_US,BMF_F990_ORG_ADDR_CITY
895640,,COTE D'IVOIRE
1391199,,CANADA
582605,,MEXICO
2162726,,ISRAEL
2588944,,CANADA


##### Add `F9_00_HD_FILER_CITY_US` to the list of final geo variables

In [49]:
geo_cols = ['BMF_F990_ORG_ADDR_STATE', 'COUNTY_CODE', 'F9_00_HD_FILER_CITY_US']
df[geo_cols].info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2598477 entries, 0 to 2598476
Data columns (total 3 columns):
 #   Column                   Non-Null Count    Dtype 
---  ------                   --------------    ----- 
 0   BMF_F990_ORG_ADDR_STATE  2597278 non-null  object
 1   COUNTY_CODE              2598477 non-null  object
 2   F9_00_HD_FILER_CITY_US   2595322 non-null  object
dtypes: object(3)
memory usage: 59.5+ MB


In [50]:
df[geo_cols].sample(5)

Unnamed: 0,BMF_F990_ORG_ADDR_STATE,COUNTY_CODE,F9_00_HD_FILER_CITY_US
2430699,WA,53073,NOOKSACK
2353692,AZ,4013,PHOENIX
1267272,MO,29095,Kansas City
2368162,UT,49011,CENTERVILLE
1325525,CA,6001,BERKELEY


### Rationalize ZIP

In [55]:
[c for c in df.columns if 'zip' in c.lower()]

['F9_00_HD_FILER_ZIP_US', 'BMF_F990_ORG_ADDR_ZIP']

In [57]:
print(len(df[df['F9_00_HD_FILER_ZIP_US'].isnull()]))
print(len(df[df['F9_00_HD_FILER_ZIP_US'].notnull()]))

3155
2595322


In [15]:
df[df['F9_00_HD_FILER_ZIP_US'].isnull()][address_cols].sample(5)

Unnamed: 0,F9_00_HD_ADDR_CHANGE,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_FILER_ADDR_US_L1,F9_00_HD_FILER_ADDR_US_L2,F9_00_HD_FILER_CITY_US,F9_00_HD_FILER_ZIP_US,F9_00_HD_FILER_COUNTRY_FRGN,F9_00_HD_FILER_STATE_US,BMF_CITY,BMF_ZIP5,BMF_FIPS
1921771,0.0,,,,,,RQ,,SAN JUAN,926.0,72127.0
644477,0.0,,,0.0,,,IT,,ITALY,0.0,
2293949,0.0,CA,,,,,CA,,CANADA,0.0,
1921694,0.0,,,,,,RQ,,BAYAMON,960.0,72021.0
1921610,0.0,,,,,,RQ,,PONCE,732.0,72113.0


In [16]:
df[df['F9_00_HD_FILER_ZIP_US'].isnull()]['F9_00_HD_CTRY_OF_DOMICILE'].value_counts().sum()

1236

In [58]:
df[df['F9_00_HD_FILER_ZIP_US'].isnull()]['F9_00_HD_CTRY_OF_DOMICILE'].value_counts()

F9_00_HD_CTRY_OF_DOMICILE
CA    556
IS    209
UK    194
SZ     44
FR     44
RQ     42
IT     42
KE     38
MX     28
AU     21
CS     17
NL     16
RP     16
JA     15
SW     14
PE     13
SN     12
AE     11
CH     10
CE     10
GR      9
NI      9
EZ      9
IV      8
GT      8
ID      8
NZ      8
OC      8
LS      7
VQ      7
KR      6
PL      4
GM      4
AC      4
HA      4
SE      4
TW      4
IN      3
BN      3
NP      2
AS      2
JM      2
SF      1
UG      1
BR      1
KS      1
BM      1
PS      1
CG      1
ET      1
Name: count, dtype: int64

<br>There are only four observations missing `F9_00_HD_FILER_ZIP_US` that don't have a foreign address

In [65]:
pd.crosstab(df['F9_00_HD_FILER_ZIP_US'].isnull(), df['ForeignAddress'].notnull())

ForeignAddress,False,True
F9_00_HD_FILER_ZIP_US,Unnamed: 1_level_1,Unnamed: 2_level_1
False,2595318,4
True,0,3155


In [66]:
print(len(df[(df['F9_00_HD_FILER_ZIP_US'].isnull())&(df['BMF_F990_ORG_ADDR_ZIP'].isnull())]))
print(len(df[(df['F9_00_HD_FILER_ZIP_US'].notnull())&(df['BMF_F990_ORG_ADDR_ZIP'].isnull())]))
print(len(df[(df['F9_00_HD_FILER_ZIP_US'].isnull())&(df['BMF_F990_ORG_ADDR_ZIP'].notnull())]))
print(len(df[(df['F9_00_HD_FILER_ZIP_US'].isnull())&(df['BMF_F990_ORG_ADDR_ZIP'].notnull())&(df['ForeignAddress'].notnull())]))

12
4227
3143
3143


In [59]:
pd.crosstab(df['F9_00_HD_FILER_ZIP_US'].isnull(), df['F9_00_HD_CTRY_OF_DOMICILE'])

F9_00_HD_CTRY_OF_DOMICILE,AC,AE,AF,AL,AM,AR,AS,AU,AX,BD,BM,BN,BR,CA,CB,CE,CG,CH,CO,CQ,CS,CT,DR,ET,EZ,FJ,FM,FR,GB,GM,GR,GT,HA,HO,ID,IN,IS,IT,IV,JA,JM,KE,KR,KS,LA,LS,MA,MD,MN,MX,NH,NI,NL,NP,NZ,OC,PA,PE,PK,PL,PO,PS,PU,RO,RP,RQ,RS,SE,SF,SG,SN,SW,SZ,TW,TX,UG,UK,UP,UY,VC,VI,VM,VQ,WA,WI
F9_00_HD_FILER_ZIP_US,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1
False,1,4,3,5,6,4,0,0,1,22,1,0,1,36,10,4,0,0,1,13,7,1,1,0,0,1,1,0,2,19,1,16,4,11,0,9,23,0,0,27,0,12,0,0,7,0,3,2,4,9,1,5,0,4,0,3,8,4,3,0,4,1,1,1,0,54,5,0,15,1,7,0,45,0,4,11,39,4,4,8,15,5,25,10,14
True,4,11,0,0,0,0,2,21,0,0,1,3,1,556,0,10,1,10,0,0,17,0,0,1,9,0,0,44,0,4,9,8,4,0,8,3,209,42,8,15,2,38,6,1,0,7,0,0,0,28,0,9,16,2,8,8,0,13,0,4,0,1,0,0,16,42,0,4,1,0,12,14,44,4,0,1,194,0,0,0,0,0,7,0,0


<br>Based on the above I think we can just use `F9_00_HD_FILER_ZIP_US`

In [68]:
df['F9_00_HD_FILER_ZIP_US'].apply(lambda x: len(str(x))).value_counts()

F9_00_HD_FILER_ZIP_US
5    2297118
9     298204
4       3155
Name: count, dtype: int64

In [69]:
%%time
df['ZIP5'] = df['F9_00_HD_FILER_ZIP_US']
df['ZIP5'] = df['ZIP5'].apply(lambda x: x[:5] if pd.notnull(x) else x)

In [72]:
df['ZIP5'].apply(lambda x: len(str(x))).value_counts()

ZIP5
5    2595322
4       3155
Name: count, dtype: int64

In [73]:
df['ZIP5_length'] = df['ZIP5'].apply(lambda x: len(str(x)))
df['ZIP5_length'].value_counts()

ZIP5_length
5    2595322
4       3155
Name: count, dtype: int64

In [83]:
df['ZIP5_lengthb'] = df['ZIP5'].apply(lambda x: np.nan if pd.isna(x) else len(str(x)))

In [76]:
df[df['ZIP5_length']==4]['F9_00_HD_FILER_ZIP_US'][:5]

18194    None
18195    None
28077    None
44208    None
45099    None
Name: F9_00_HD_FILER_ZIP_US, dtype: object

<br>All values where `ZIP5` has a `ZIP_length` of 4 is where the value is missing.

In [89]:
df[df['ZIP5_length']==4][geo_cols+zip_cols+['ZIP5_length', 'ZIP5_lengthb']].sample(5)

Unnamed: 0,BMF_F990_ORG_ADDR_STATE,COUNTY_CODE,F9_00_HD_FILER_CITY_US,F9_00_HD_FILER_ZIP_US,BMF_F990_ORG_ADDR_ZIP,ZIP5_length,ZIP5_lengthb
2589033,MA,0,,,00000-0000,4,
1034647,IL,0,,,00000-0000,4,
2589878,,0,,,00000-0000,4,
2589627,,0,,,00000-0000,4,
2589240,OK,40115,,,00000-0000,4,


In [91]:
df[df['ZIP5']=='00000'][geo_cols+zip_cols+['ZIP5_length', 'ZIP5_lengthb']]

Unnamed: 0,BMF_F990_ORG_ADDR_STATE,COUNTY_CODE,F9_00_HD_FILER_CITY_US,F9_00_HD_FILER_ZIP_US,BMF_F990_ORG_ADDR_ZIP,ZIP5_length,ZIP5_lengthb


In [92]:
df[df['BMF_F990_ORG_ADDR_ZIP']=='00000'][geo_cols+zip_cols+['ZIP5_length', 'ZIP5_lengthb']]

Unnamed: 0,BMF_F990_ORG_ADDR_STATE,COUNTY_CODE,F9_00_HD_FILER_CITY_US,F9_00_HD_FILER_ZIP_US,BMF_F990_ORG_ADDR_ZIP,ZIP5_length,ZIP5_lengthb


In [94]:
print(len(df[df['BMF_F990_ORG_ADDR_ZIP']=='00000-0000'][geo_cols+zip_cols+['ZIP5_length', 'ZIP5_lengthb']]))
df[df['BMF_F990_ORG_ADDR_ZIP']=='00000-0000'][geo_cols+zip_cols+['ZIP5_length', 'ZIP5_lengthb']].sample(5)

2264


Unnamed: 0,BMF_F990_ORG_ADDR_STATE,COUNTY_CODE,F9_00_HD_FILER_CITY_US,F9_00_HD_FILER_ZIP_US,BMF_F990_ORG_ADDR_ZIP,ZIP5_length,ZIP5_lengthb
598600,TX,0,FORT WORTH,76102.0,00000-0000,5,5.0
2261036,NY,21001,Long Island City,11101.0,00000-0000,5,5.0
560343,,0,,,00000-0000,4,
1584991,DC,0,,,00000-0000,4,
2590282,,0,,,00000-0000,4,


##### Replace '00000-0000' with missing

In [95]:
df['BMF_F990_ORG_ADDR_ZIP'] = df['BMF_F990_ORG_ADDR_ZIP'].replace('00000-0000', np.nan)

In [97]:
print(len(df[df['BMF_F990_ORG_ADDR_ZIP']=='00000-0000'][geo_cols+zip_cols+['ZIP5_length', 'ZIP5_lengthb']]))
df[df['BMF_F990_ORG_ADDR_ZIP']=='00000-0000'][geo_cols+zip_cols+['ZIP5_length', 'ZIP5_lengthb']]

0


Unnamed: 0,BMF_F990_ORG_ADDR_STATE,COUNTY_CODE,F9_00_HD_FILER_CITY_US,F9_00_HD_FILER_ZIP_US,BMF_F990_ORG_ADDR_ZIP,ZIP5_length,ZIP5_lengthb


In [98]:
print(len(df[(df['F9_00_HD_FILER_ZIP_US'].isnull())&(df['BMF_F990_ORG_ADDR_ZIP'].isnull())]))
print(len(df[(df['F9_00_HD_FILER_ZIP_US'].notnull())&(df['BMF_F990_ORG_ADDR_ZIP'].isnull())]))
print(len(df[(df['F9_00_HD_FILER_ZIP_US'].isnull())&(df['BMF_F990_ORG_ADDR_ZIP'].notnull())]))
print(len(df[(df['F9_00_HD_FILER_ZIP_US'].isnull())&(df['BMF_F990_ORG_ADDR_ZIP'].notnull())&(df['ForeignAddress'].notnull())]))

2129
4374
1026
1026


In [99]:
df['ZIP5_length'] = df['ZIP5'].apply(lambda x: len(str(x)))
df['ZIP5_length'].value_counts()

ZIP5_length
5    2595322
4       3155
Name: count, dtype: int64

In [101]:
df[df['ZIP5_length']==4][geo_cols+zip_cols+['ZIP5']].sample(5)

Unnamed: 0,BMF_F990_ORG_ADDR_STATE,COUNTY_CODE,F9_00_HD_FILER_CITY_US,F9_00_HD_FILER_ZIP_US,BMF_F990_ORG_ADDR_ZIP,ZIP5
593039,PR,72087,,,00772-0509,
2589304,,0,,,,
1931123,PR,72061,,,00970-3930,
1570433,DC,0,,,,
2589282,,0,,,,


In [103]:
df = df.drop('ZIP5_length', axis=1)
df = df.drop('ZIP5_lengthb', axis=1)

##### Add `ZIP5` to `geo_cols`

In [105]:
geo_cols = geo_cols + ['ZIP5']

In [106]:
df[geo_cols].sample(5)

Unnamed: 0,BMF_F990_ORG_ADDR_STATE,COUNTY_CODE,F9_00_HD_FILER_CITY_US,ZIP5
2342518,AZ,4025,PRESCOTT,86301
2216425,CA,6085,San Jose,95112
612092,MA,25021,WALPOLE,2081
2366238,UT,49035,SALT LAKE CITY,84101
2569990,CA,6037,PACIFIC PALISADES,90272


### Rationalize street address

In [111]:
[c for c in df.columns if 'add' in c.lower()]

['USAddress',
 'ForeignAddress',
 'F9_00_HD_ADDR_CHANGE',
 'F9_06_PC_OFFICER_MAILING_ADDRESS',
 'F9_00_HD_FILER_ADDR_US_L1',
 'F9_00_HD_FILER_ADDR_US_L2',
 'BMF_F990_ORG_ADDR_CITY',
 'BMF_F990_ORG_ADDR_STATE',
 'BMF_F990_ORG_ADDR_ZIP',
 'BMF_F990_ORG_ADDR_STREET',
 'BMF_ORG_ADDR_FULL',
 'BMF_ORG_ADDR_MATCH']

In [114]:
df[['F9_00_HD_FILER_ADDR_US_L1', 'F9_00_HD_FILER_ADDR_US_L2', 'BMF_F990_ORG_ADDR_STREET', 'BMF_ORG_ADDR_FULL',
    'BMF_ORG_ADDR_MATCH']].sample(5)

Unnamed: 0,F9_00_HD_FILER_ADDR_US_L1,F9_00_HD_FILER_ADDR_US_L2,BMF_F990_ORG_ADDR_STREET,BMF_ORG_ADDR_FULL,BMF_ORG_ADDR_MATCH
936963,3635 WEST VALENCIA DRIVE,,3635 W VALENCIA DR,"3635 W VALENCIA DR,FULLERTON,CA,92833-3134","3635 W Valencia Dr, Fullerton, California, 92833"
744002,PO BOX 14574,,PO BOX 14574,"PO BOX 14574,MADISON,WI,53708-0574","53708-0574, Madison, Wisconsin"
2434780,4700 228th St SW,,4700 228TH ST SW,"4700 228TH ST SW,MOUNTLAKE TER,WA,98043-4429","4700 228th St SW, Mountlake Terrace, Washington, 98043"
968170,413 OAKWOOD DR,,413 OAKWOOD DR,"413 OAKWOOD DR,CADIZ,OH,43907-1143","413 Oakwood Dr, Cadiz, Ohio, 43907"
1995996,3037 NW 63RD ST STE W104,,3037 NW 63RD ST STE W104,"3037 NW 63RD ST STE W104,OKLAHOMA CITY,OK,73116-3637","3037 NW 63rd St, Oklahoma City, Oklahoma, 73116"


In [116]:
len(df)

2598477

In [115]:
%%time
df[['F9_00_HD_FILER_ADDR_US_L1', 'F9_00_HD_FILER_ADDR_US_L2', 'BMF_F990_ORG_ADDR_STREET', 'BMF_ORG_ADDR_FULL',
    'BMF_ORG_ADDR_MATCH']].describe().T

CPU times: total: 2.91 s
Wall time: 3.07 s


Unnamed: 0,count,unique,top,freq
F9_00_HD_FILER_ADDR_US_L1,2595322,511035,2335 NORTH BANK DRIVE,2237
F9_00_HD_FILER_ADDR_US_L2,31890,3745,Suite,10741
BMF_F990_ORG_ADDR_STREET,2594149,260532,PO BOX 45998,2633
BMF_ORG_ADDR_FULL,2594238,316815,"PO BOX 45998,SAINT LOUIS,MO,63145-5998",2537
BMF_ORG_ADDR_MATCH,2592238,282135,"63145-5998, Saint Louis, Missouri",2537


In [120]:
print(len(df[(df['F9_00_HD_FILER_ADDR_US_L1'].isnull())&(df['BMF_F990_ORG_ADDR_STREET'].isnull())]))
print(len(df[(df['F9_00_HD_FILER_ADDR_US_L1'].notnull())&(df['BMF_F990_ORG_ADDR_STREET'].isnull())]))
print(len(df[(df['F9_00_HD_FILER_ADDR_US_L1'].isnull())&(df['BMF_F990_ORG_ADDR_STREET'].notnull())]))
print(len(df[(df['F9_00_HD_FILER_ADDR_US_L1'].isnull())&(df['BMF_F990_ORG_ADDR_STREET'].notnull())&(df['ForeignAddress'].notnull())]))

12
4316
3143
3143


##### Based on above, just use `F9_00_HD_FILER_ADDR_US_L1`

In [121]:
geo_cols

['BMF_F990_ORG_ADDR_STATE', 'COUNTY_CODE', 'F9_00_HD_FILER_CITY_US', 'ZIP5']

In [122]:
geo_cols = geo_cols+['F9_00_HD_FILER_ADDR_US_L1']
df[geo_cols].sample(5)

Unnamed: 0,BMF_F990_ORG_ADDR_STATE,COUNTY_CODE,F9_00_HD_FILER_CITY_US,ZIP5,F9_00_HD_FILER_ADDR_US_L1
993289,IN,18057,INDIANAPOLIS,46240,510 E 96TH ST STE 180
5182,ME,23005,PORTLAND,4101,522 CONGRESS STREET
343989,PA,42017,NEWTOWN,18940,170 PHEASANT RUN SUITE 100
850625,NY,36085,STATEN ISLAND,10312,380 GENESEE AVE
247846,OH,39113,DAYTON,45401,PO BOX 2307


#### Verify `COUNTY_CODE`
- Verify that I can ignore the remaining city, state, zip, and county FIPS code below 

In [123]:
df[geo_cols].info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2598477 entries, 0 to 2598476
Data columns (total 5 columns):
 #   Column                     Non-Null Count    Dtype 
---  ------                     --------------    ----- 
 0   BMF_F990_ORG_ADDR_STATE    2597278 non-null  object
 1   COUNTY_CODE                2598477 non-null  object
 2   F9_00_HD_FILER_CITY_US     2595322 non-null  object
 3   ZIP5                       2595322 non-null  object
 4   F9_00_HD_FILER_ADDR_US_L1  2595322 non-null  object
dtypes: object(5)
memory usage: 99.1+ MB


In [124]:
df[geo_cols].isna().sum()

BMF_F990_ORG_ADDR_STATE      1199
COUNTY_CODE                     0
F9_00_HD_FILER_CITY_US       3155
ZIP5                         3155
F9_00_HD_FILER_ADDR_US_L1    3155
dtype: int64

In [152]:
df[geo_cols].sample(5)

Unnamed: 0,BMF_F990_ORG_ADDR_STATE,COUNTY_CODE,F9_00_HD_FILER_CITY_US,ZIP5,F9_00_HD_FILER_ADDR_US_L1
968357,OH,39095,SYLVANIA,43560,4747 N HOLLAND-SYLVANIA RD
1262928,MO,29169,RICHLAND,65556,306 S PINE PO BOX 69
489030,NJ,34025,MORGANVILLE,7751,70 AMBOY ROAD SUITE 101
2499275,CA,6087,SANTA CRUZ,95062,200 7TH AVENUE
1176764,WI,55025,MADISON,53711,2970 CHAPEL VALLEY RD NO 203


In [149]:
print(len(df.query('COUNTY_CODE=="00000" & ForeignAddress.notna()')))

1711


In [148]:
df.query('COUNTY_CODE=="00000" & ForeignAddress.notna()')[geo_cols+['ForeignAddress']]

Unnamed: 0,BMF_F990_ORG_ADDR_STATE,COUNTY_CODE,F9_00_HD_FILER_CITY_US,ZIP5,F9_00_HD_FILER_ADDR_US_L1,ForeignAddress
44208,WA,00000,,,,"{'AddressLine1': None, 'AddressLine1Txt': 'PO Box 19142 1156 56th St', 'AddressLine2': None, 'AddressLine2Txt': None, 'City': None, 'CityNm': 'Delta', 'Country': None, 'CountryCd': 'CA', 'ForeignPostalCd': 'V4L 2P8', 'PostalCode': None, 'ProvinceOrState': None, 'ProvinceOrStateNm': 'British Columbia'}"
45099,DC,00000,,,,"{'AddressLine1': '29 BOWEN STREET', 'AddressLine1Txt': None, 'AddressLine2': None, 'AddressLine2Txt': None, 'City': 'CAMBERWELL VICTORIA AU', 'CityNm': None, 'Country': 'AS', 'CountryCd': None, 'ForeignPostalCd': None, 'PostalCode': None, 'ProvinceOrState': None, 'ProvinceOrStateNm': None}"
45100,DC,00000,,,,"{'AddressLine1': '29 BOWEN STREET', 'AddressLine1Txt': None, 'AddressLine2': None, 'AddressLine2Txt': None, 'City': 'CAMBERWELL VICTORIA AU', 'CityNm': None, 'Country': 'AS', 'CountryCd': None, 'ForeignPostalCd': None, 'PostalCode': None, 'ProvinceOrState': None, 'ProvinceOrStateNm': None}"
45101,DC,00000,,,,"{'AddressLine1': '29 BOWEN STREET', 'AddressLine1Txt': None, 'AddressLine2': None, 'AddressLine2Txt': None, 'City': 'CAMBERWELL VICTORIA AU', 'CityNm': None, 'Country': 'AS', 'CountryCd': None, 'ForeignPostalCd': None, 'PostalCode': None, 'ProvinceOrState': None, 'ProvinceOrStateNm': None}"
86221,MA,00000,,,,"{'AddressLine1': 'PK 37 FATIH POSTAHANESI', 'AddressLine1Txt': None, 'AddressLine2': None, 'AddressLine2Txt': None, 'City': 'FATIH ISTANBUL TURKEY', 'CityNm': None, 'Country': 'TU', 'CountryCd': None, 'ForeignPostalCd': None, 'PostalCode': None, 'ProvinceOrState': None, 'ProvinceOrStateNm': None}"
...,...,...,...,...,...,...
2590680,,00000,,,,"{'AddressLine1': None, 'AddressLine1Txt': 'PO BOX 366', 'AddressLine2': None, 'AddressLine2Txt': None, 'City': None, 'CityNm': 'BIKENIBEU TARAWA', 'Country': None, 'CountryCd': 'KR', 'ForeignPostalCd': None, 'PostalCode': None, 'ProvinceOrState': None, 'ProvinceOrStateNm': None}"
2590681,,00000,,,,"{'AddressLine1': None, 'AddressLine1Txt': 'PO BOX 366', 'AddressLine2': None, 'AddressLine2Txt': None, 'City': None, 'CityNm': 'BETIO TARAWA', 'Country': None, 'CountryCd': 'KR', 'ForeignPostalCd': None, 'PostalCode': None, 'ProvinceOrState': None, 'ProvinceOrStateNm': None}"
2590682,,00000,,,,"{'AddressLine1': None, 'AddressLine1Txt': 'PO BOX 336', 'AddressLine2': None, 'AddressLine2Txt': None, 'City': None, 'CityNm': 'BIKENIBEU TARAWA', 'Country': None, 'CountryCd': 'KR', 'ForeignPostalCd': None, 'PostalCode': None, 'ProvinceOrState': None, 'ProvinceOrStateNm': None}"
2590683,,00000,,,,"{'AddressLine1': None, 'AddressLine1Txt': 'PO BOX 336', 'AddressLine2': None, 'AddressLine2Txt': None, 'City': None, 'CityNm': 'BIKENIBEU TARAWA', 'Country': None, 'CountryCd': 'KR', 'ForeignPostalCd': None, 'PostalCode': None, 'ProvinceOrState': None, 'ProvinceOrStateNm': None}"


In [150]:
%%time
df['COUNTY_CODE'] = df['COUNTY_CODE'].replace('00000', np.nan)

CPU times: total: 203 ms
Wall time: 229 ms


In [153]:
df[geo_cols].isna().sum()

BMF_F990_ORG_ADDR_STATE      1199
COUNTY_CODE                  6961
F9_00_HD_FILER_CITY_US       3155
ZIP5                         3155
F9_00_HD_FILER_ADDR_US_L1    3155
dtype: int64

#### Check state
``BMF_STATE`` is the variable to use. See this notebook: *IRS 990 e-File Data -- IRS Files (8b) -- Fix State Variable and Inspect Industry Variables.ipynb*

In [155]:
geo_cols

['BMF_F990_ORG_ADDR_STATE',
 'COUNTY_CODE',
 'F9_00_HD_FILER_CITY_US',
 'ZIP5',
 'F9_00_HD_FILER_ADDR_US_L1']

In [156]:
pd.concat([df[geo_cols].isnull().sum(), 
           df[geo_cols].isnull().sum()/len(df)*100], axis=1)

Unnamed: 0,0,1
BMF_F990_ORG_ADDR_STATE,1199,0.05
COUNTY_CODE,6961,0.27
F9_00_HD_FILER_CITY_US,3155,0.12
ZIP5,3155,0.12
F9_00_HD_FILER_ADDR_US_L1,3155,0.12


#### Save DF

In [159]:
print(len(set(df['EIN'].tolist())))
print(len(df))

351875
2598477


In [160]:
%%time
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df.to_feather('D:/990_and_bmf_april_2025_all_controls_351875_orgs_2598477_filings_no_duplicates_fixed_state_ntee_zip.feather')

Current date and time :  2025-06-20 16:19:26 

CPU times: total: 43.9 s
Wall time: 32.5 s


In [161]:
%%time
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df.to_parquet("D:/990_and_bmf_april_2025_all_controls_351875_orgs_2598477_filings_no_duplicates_fixed_state_ntee_zip.parquet", engine="pyarrow", compression="snappy", index=False)

Current date and time :  2025-06-20 16:22:07 

CPU times: total: 1min 15s
Wall time: 1min 17s


In [162]:
%%time
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df.to_pickle('990_and_bmf_april_2025_all_controls_351875_orgs_2598477_filings_no_duplicates_fixed_state_ntee_zip.pkl.gz', compression='gzip')

Current date and time :  2025-06-20 16:23:38 

CPU times: total: 36min 58s
Wall time: 38min


In [163]:
%%time
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df.to_csv('990_and_bmf_april_2025_all_controls_351875_orgs_2598477_filings_no_duplicates_fixed_state_ntee_zip.csv')

Current date and time :  2025-06-20 17:05:18 

CPU times: total: 9min 41s
Wall time: 10min 22s
