# Notes
- In future runs remove these variables from *binarize_cols*
    - *F9_12_PC_ACCTG_METHOD_OTHER*
    - *F9_00_HD_EXEMPT_STATUS_501C*

# Overview
This notebook needs to be updated to take into account the new way of creating the *ReturnHeader* variables. Namely, I have updated the concordance file to include additional *ReturnHeader* variables. I have also changed the MongoDB name to be not, for example, 'ReturnHeader.TaxYear' but instead 'ReturnHeader'. In notebook 'A1' I flatten the *ReturnHeader* column and then I will do the combining and renaming. So, I need to make sure the new variables work, and also modify the code below for the following variables:

    'ReturnHeader.TaxYear': 1, 'ReturnHeader.TaxYr': 1,
    'ReturnHeader.TaxPeriodEndDate': 1, 'ReturnHeader.TaxPeriodEndDt': 1,  

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

In [2]:
print(pd.__version__)

1.1.5


In [3]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)

#### Set working directory

In [25]:
#cd '/Users/gsaxton/Dropbox/990 e-file data'

In [26]:
pwd

'C:\\Users\\Gregory\\IRS 990 Control Variables'

In [4]:
cd "C:\\Users\\Gregory\\IRS 990 Control Variables\\"

C:\Users\Gregory\IRS 990 Control Variables


# Read in Concordance File
We are going to read in two codebooks. First, there is the 'concordance' file. Specifically, before re-arranging and renaming variables, we will read in the relevant section from the *master concordance* file, and then use this file to identify the relevant 'compensation' variables. In a following notebook, we will be using the *new_variable_name* field as our variable name.

In [39]:
concordance = pd.read_excel('concordance_VERIFIED.xlsx')
print('# of columns:', len(concordance.columns))
print('# of observations:', len(concordance))
concordance[:2]

# of columns: 15
# of observations: 388


Unnamed: 0,xpath,project,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,BINARIZE,MongoDB_Name,sub_key,sub_sub_key
0,/Return/ReturnHeader/TaxPeriodEndDate,,F9_00_HD_TAX_PER_END,,,Will be nested under ReturnHeader,,Tax period end date,HEADER,HD,DateType,,TaxPeriodEndDate,,
1,/Return/ReturnHeader/TaxPeriodEndDt,,F9_00_HD_TAX_PER_END,,,Will be nested under ReturnHeader,,Tax period end date,HEADER,HD,DateType,,TaxPeriodEndDt,,


In [29]:
concordance['data_type_xsd'].value_counts().sum()

372

In [30]:
concordance[concordance['data_type_xsd'].isnull()][['variable_name_new', 'description']]

Unnamed: 0,variable_name_new,description
4,F9_00_HD_BUILD_TIME_STAMP,Build time stamp - IRS internal field
11,TaxPeriod,
290,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,
291,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,
292,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,
293,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,
294,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,
295,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,
296,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,
297,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,


In [31]:
concordance['data_type_xsd'].value_counts()

USAmountType            167
BooleanType              72
CheckboxType             48
USAmountNNType           39
IntegerNNType             8
CountType                 8
StateType                 6
DateType                  4
YearType                  4
LineExplanationType       4
TextType                  2
PersonNameType            2
CountryType               2
StringType                2
ShortExplanationType      2
TimestampType             2
Name: data_type_xsd, dtype: int64

In [32]:
concordance[concordance['data_type_xsd']=='BooleanType'][:1]

Unnamed: 0,xpath,project,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,BINARIZE,MongoDB_Name,sub_key,sub_sub_key
26,/Return/ReturnData/IRS990/GroupReturnForAffiliates,,F9_00_HD_GROUP_RETURN,,,,,Indicates this form is a group return for subordinates,F990-PC-PART-00-SECTION-HA,PART-00,BooleanType,binarize,GroupReturnForAffiliates,,


In [33]:
concordance[concordance['data_type_xsd']=='CheckboxType'][:1]

Unnamed: 0,xpath,project,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,BINARIZE,MongoDB_Name,sub_key,sub_sub_key
14,/Return/ReturnData/IRS990/AddressChange,,F9_00_HD_ADDR_CHANGE,20.0,,,,Indicates this form has an address change,F990-PC-PART-00-SECTION-B,PART-00,CheckboxType,binarize,AddressChange,,


# Read 990 DB into PANDAS 
- In previous round there were 1,547,828 observations; in Feb. 2020 there were 1,727,056 observations; in Nov. 2020 there are 1,895,016 observations
- I also switched to using the *.pkl* file

In [13]:
#df = pd.read_csv('all filings - all control variables.csv', low_memory=False)
#print('# of columns:', len(df.columns))
#print('# of observations:', len(df))
#df[:2]

In [34]:
%%time
df = pd.read_pickle('all filings nov. 2020 - all control variables.pkl.gz', compression='gzip')
print('# of columns:', len(df.columns))
print('# of observations:', len(df))
df[:2]

# of columns: 324
# of observations: 1895016


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,ReturnHeader,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,TypeOfOrganizationCorporation,YearFormation,StateLegalDomicile,ActivityOrMissionDescription,NbrVotingMembersGoverningBody,NbrIndependentVotingMembers,TotalNbrEmployees,TotalNbrVolunteers,TotalGrossUBI,NetUnrelatedBusinessTxblIncome,ContributionsGrantsPriorYear,ContributionsGrantsCurrentYear,ProgramServiceRevenuePriorYear,ProgramServiceRevenueCY,InvestmentIncomePriorYear,InvestmentIncomeCurrentYear,OtherRevenuePriorYear,OtherRevenueCurrentYear,TotalRevenuePriorYear,TotalRevenueCurrentYear,GrantsAndSimilarAmntsPriorYear,GrantsAndSimilarAmntsCY,BenefitsPaidToMembersPriorYear,BenefitsPaidToMembersCY,SalariesEtcPriorYear,SalariesEtcCurrentYear,TotalProfFundrsngExpPriorYear,TotalProfFundrsngExpCY,TotalFundrsngExpCurrentYear,OtherExpensePriorYear,OtherExpensesCurrentYear,TotalExpensesPriorYear,TotalExpensesCurrentYear,RevenuesLessExpensesPriorYear,RevenuesLessExpensesCY,TotalAssetsBOY,TotalAssetsEOY,TotalLiabilitiesBOY,TotalLiabilitiesEOY,NetAssetsOrFundBalancesBOY,NetAssetsOrFundBalancesEOY,ProfessionalFundraising,FundraisingActivities,Gaming,NbrVotingGoverningBodyMembers,NumberIndependentVotingMembers,FamilyOrBusinessRelationship,DelegationOfManagementDuties,ChangesToOrganizingDocs,MaterialDiversionOrMisuse,MembersOrStockholders,ElectionOfBoardMembers,DecisionsSubjectToApproval,MinutesOfGoverningBody,MinutesOfCommittees,OfficerMailingAddress,LocalChapters,Form990ProvidedToGoverningBody,ConflictOfInterestPolicy,AnnualDisclosureCoveredPersons,RegularMonitoringEnforcement,WhistleblowerPolicy,DocumentRetentionPolicy,CompensationProcessCEO,CompensationProcessOther,InvestmentInJointVenture,StatesWhereCopyOfReturnIsFiled,UponRequest,NoListedPersonsCompensated,TotalReportableCompFromOrg,TotalReportableCompFrmRltdOrgs,TotalOtherCompensation,NumberIndividualsGT100K,FormersListed,TotalCompGT150K,CompensationFromOtherSources,NumberOfContractorsGT100K,AllOtherContributions,TotalContributions,TotalOtherRevenue,TotalRevenue,FeesForServicesLegal,FeesForServicesAccounting,TotalFunctionalExpenses,SavingsAndTempCashInvestments,LandBuildingsEquipmentBasis,LandBldgEquipmentAccumDeprec,TotalAssets,OtherLiabilities,FollowSFAS117,ReconcilationRevenueExpenses,MethodOfAccountingAccrual,AccountantCompileOrReview,FSAudited,AuditCommittee,FederalGrantAuditRequired,AllAffiliatesIncluded,GroupExemptionNumber,PoliciesReferenceChapters,WrittenPolicyOrProcedure,TotalProgramServiceRevenue,CompCurrentOfficersDirectors,CompDisqualPersons,OtherSalariesAndWages,PensionPlanContributions,OtherEmployeeBenefits,PayrollTaxes,FeesForServicesManagement,FeesForServicesLobbying,FeesForServicesProfFundraising,FeesForServicesInvstMgmntFees,FeesForServicesOther,CashNonInterestBearing,MortNotesPyblSecuredInvestProp,FederalGrantAuditPerformed,LoansFromOfficersDirectors,MethodOfAccountingCash,TaxExemptBondLiabilities,OtherWebsite,FundraisingEvents,CntrbtnsRprtdFundraisingEvents,RelatedOrganizations,GrossIncomeFundraisingEvents,FundraisingDirectExpenses,FederatedCampaigns,GovernmentGrants,MethodOfAccountingOther,GrossSalesOfInventory,CostOfGoodsSold,DoNotFollowSFAS117,RetainedEarningsEndowmentEtc,InitialReturn,MembershipDues,GrossIncomeGaming,GamingDirectExpenses,NoncashContributions,OwnWebsite,UnsecuredNotesLoansPayable,TypeOfOrganizationOther,Organization501c,TypeOfOrganizationTrust,TypeOfOrganizationAssociation,CountryLegalDomicile,AmendedReturn,TypeOfOrgOtherDescription,TerminatedReturn,TerminationOrContraction,SpecialConditionDescription,Organization4947a1,ReconciliationUnrealizedInvest,ReconcilationPriorAdjustment,ReconcilationDonatedServices,ReconcilationInvestExpenses,PrincipalOfficerNm,GrossReceiptsAmt,GroupReturnForAffiliatesInd,Organization501c3Ind,TypeOfOrganizationCorpInd,FormationYr,LegalDomicileStateCd,ActivityOrMissionDesc,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,TotalEmployeeCnt,TotalGrossUBIAmt,CYContributionsGrantsAmt,CYProgramServiceRevenueAmt,CYInvestmentIncomeAmt,CYOtherRevenueAmt,CYTotalRevenueAmt,CYGrantsAndSimilarPaidAmt,CYBenefitsPaidToMembersAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalProfFndrsngExpnsAmt,CYTotalFundraisingExpenseAmt,CYOtherExpensesAmt,CYTotalExpensesAmt,CYRevenuesLessExpensesAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalLiabilitiesEOYAmt,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,ProfessionalFundraisingInd,FundraisingActivitiesInd,GamingActivitiesInd,GoverningBodyVotingMembersCnt,IndependentVotingMemberCnt,FamilyOrBusinessRlnInd,DelegationOfMgmtDutiesInd,ChangeToOrgDocumentsInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,ElectionOfBoardMembersInd,DecisionsSubjectToApprovaInd,MinutesOfGoverningBodyInd,MinutesOfCommitteesInd,OfficerMailingAddressInd,LocalChaptersInd,Form990ProvidedToGvrnBodyInd,ConflictOfInterestPolicyInd,WhistleblowerPolicyInd,DocumentRetentionPolicyInd,CompensationProcessCEOInd,CompensationProcessOtherInd,InvestmentInJointVentureInd,StatesWhereCopyOfReturnIsFldCd,NoListedPersonsCompensatedInd,FormerOfcrEmployeesListedInd,TotalCompGreaterThan150KInd,CompensationFromOtherSrcsInd,MembershipDuesAmt,FundraisingAmt,AllOtherContributionsAmt,TotalContributionsAmt,OtherRevenueTotalAmt,TotalRevenueGrp,FeesForServicesAccountingGrp,TotalFunctionalExpensesGrp,CashNonInterestBearingGrp,TotalAssetsGrp,OrgDoesNotFollowSFAS117Ind,RtnEarnEndowmentIncmOthFndsGrp,ReconcilationRevenueExpnssAmt,MethodOfAccountingCashInd,AccountantCompileOrReviewInd,FSAuditedInd,FederalGrantAuditRequiredInd,WebsiteAddressTxt,TotalVolunteersCnt,NetUnrelatedBusTxblIncmAmt,PYContributionsGrantsAmt,PYProgramServiceRevenueAmt,PYInvestmentIncomeAmt,PYOtherRevenueAmt,PYTotalRevenueAmt,PYGrantsAndSimilarPaidAmt,PYBenefitsPaidToMembersAmt,PYSalariesCompEmpBnftPaidAmt,PYTotalProfFndrsngExpnsAmt,PYOtherExpensesAmt,PYTotalExpensesAmt,PYRevenuesLessExpensesAmt,TotalLiabilitiesBOYAmt,AnnualDisclosureCoveredPrsnInd,RegularMonitoringEnfrcInd,UponRequestInd,TotalReportableCompFromOrgAmt,TotReportableCompRltdOrgAmt,TotalOtherCompensationAmt,IndivRcvdGreaterThan100KCnt,CntrctRcvdGreaterThan100KCnt,GovernmentGrantsAmt,TotalProgramServiceRevenueAmt,FundraisingGrossIncomeAmt,ContriRptFundraisingEventAmt,FundraisingDirectExpensesAmt,GrossSalesOfInventoryAmt,CostOfGoodsSoldAmt,CompCurrentOfcrDirectorsGrp,OtherSalariesAndWagesGrp,PensionPlanContributionsGrp,OtherEmployeeBenefitsGrp,PayrollTaxesGrp,FeesForServicesOtherGrp,SavingsAndTempCashInvstGrp,LandBldgEquipCostOrOtherBssAmt,LandBldgEquipAccumDeprecAmt,MortgNotesPyblScrdInvstPropGrp,OtherLiabilitiesGrp,OrganizationFollowsSFAS117Ind,NetUnrlzdGainsLossesInvstAmt,AuditCommitteeInd,AllAffiliatesIncludedInd,CompDisqualPersonsGrp,FeesForServicesManagementGrp,FeesForServicesLegalGrp,FeesForServicesLobbyingGrp,FeesForSrvcInvstMgmntFeesGrp,MethodOfAccountingAccrualInd,NoncashContributionsAmt,TaxExemptBondLiabilitiesGrp,LoansFromOfficersDirectorsGrp,UnsecuredNotesLoansPayableGrp,PriorPeriodAdjustmentsAmt,FederalGrantAuditPerformedInd,PoliciesReferenceChaptersInd,OtherWebsiteInd,AddressChangeInd,WrittenPolicyOrProcedureInd,RelatedOrganizationsAmt,OwnWebsiteInd,DonatedServicesAndUseFcltsAmt,LegalDomicileCountryCd,TypeOfOrganizationTrustInd,FinalReturnInd,ContractTerminationInd,GroupExemptionNum,FederatedCampaignsAmt,TypeOfOrganizationOtherInd,OtherOrganizationDsc,TypeOfOrganizationAssocInd,InitialReturnInd,GamingGrossIncomeAmt,GamingDirectExpensesAmt,MethodOfAccountingOtherInd,InvestmentExpenseAmt,Organization501cInd,Organization4947a1NotPFInd,AmendedReturnInd,SpecialConditionDesc,fiscal_year
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,"{'@binaryAttachmentCount': '0', 'Timestamp': '2011-11-09T06:41:09-06:00', 'TaxPeriodEndDate': '2010-12-31', 'PreparerFirm': {'PreparerFirmBusinessName': {'BusinessNameLine1': 'CONCANNON MILLER & CO PC'}, 'PreparerFirmUSAddress': {'AddressLine1': ...",X,MICHAEL ANTON,1473903,0,X,,X,1992,PA,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,10,10,0,0.0,0,0.0,1044925.0,1439340,0,0,30447,33563,0.0,1000,1075372,1473903,638637.0,925000,0.0,0,0,0,0.0,0,195892,243131,459751,881768,1384751,193604,89152,1925215,2440859,171810,450430,1753405,1990429,0,0,0,10,10,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,0,0,"[PA, NJ, DE]",X,X,0.0,0,0,0,0,0,0,0,1439340.0,1439340,1000,"{'TotalRevenueColumn': '1473903', 'RelatedOrExemptFunctionIncome': '1000', 'UnrelatedBusinessRevenue': '0', 'ExclusionAmount': '33563'}","{'Total': '215', 'ManagementAndGeneral': '215'}","{'Total': '21675', 'ManagementAndGeneral': '21675'}","{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}","{'BOY': '332660', 'EOY': '270700'}",256845,86228,"{'BOY': '1925215', 'EOY': '2440859'}","{'BOY': '51640', 'EOY': '240077'}",X,89152,X,0,1,1,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2010
1,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,201106,581805618,"{'@binaryAttachmentCount': '0', 'Timestamp': '2011-11-09T07:32:06-08:00', 'TaxPeriodEndDate': '2011-06-30', 'PreparerFirm': {'PreparerFirmBusinessName': {'BusinessNameLine1': 'MADDOX & ASSOCIATES APC'}, 'PreparerFirmUSAddress': {'AddressLine1': '...",,,266420,false,X,,X,1993,WY,PROVIDE HOUSING FOR THE ELDERLY AND THE DISABLED UNDER SECTION 202 OF THE NATIONAL HOUSING ACT UNDER AN AGREEMENT WITH THE DEPARTMENT OF HUD.,19,13,0,,0,,,0,222839,265592,1425,828,,0,224264,266420,,0,,0,71405,82955,,0,0,189785,222550,261190,305505,-36926,-39085,1455332,1433342,17482,34577,1437850,1398765,false,false,false,19,13,true,true,false,false,false,true,true,true,true,false,false,true,true,true,true,false,false,true,true,false,,X,,,1180355,411648,0,true,true,false,0,,0,0,"{'TotalRevenueColumn': '266420', 'RelatedOrExemptFunctionIncome': '266420'}",{'Total': '0'},"{'Total': '7500', 'ManagementAndGeneral': '7500'}","{'Total': '305505', 'ProgramServices': '276405', 'ManagementAndGeneral': '29100', 'Fundraising': '0'}",{'EOY': '0'},2187206,904332,"{'BOY': '1455332', 'EOY': '1433342'}","{'BOY': '9203', 'EOY': '11349'}",X,-39085,X,false,true,true,true,False,1736.0,False,False,265592.0,{'Total': '0'},{'Total': '0'},"{'Total': '59440', 'ProgramServices': '59440'}",{'Total': '0'},"{'Total': '17714', 'ProgramServices': '17714'}","{'Total': '5801', 'ProgramServices': '5801'}","{'Total': '21600', 'ManagementAndGeneral': '21600'}",{'Total': '0'},{'Total': '0'},{'Total': '0'},{'Total': '0'},"{'BOY': '250', 'EOY': '22261'}","{'BOY': '6219', 'EOY': '7035'}",True,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2011


In [35]:
df[-2:]

Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,ReturnHeader,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,TypeOfOrganizationCorporation,YearFormation,StateLegalDomicile,ActivityOrMissionDescription,NbrVotingMembersGoverningBody,NbrIndependentVotingMembers,TotalNbrEmployees,TotalNbrVolunteers,TotalGrossUBI,NetUnrelatedBusinessTxblIncome,ContributionsGrantsPriorYear,ContributionsGrantsCurrentYear,ProgramServiceRevenuePriorYear,ProgramServiceRevenueCY,InvestmentIncomePriorYear,InvestmentIncomeCurrentYear,OtherRevenuePriorYear,OtherRevenueCurrentYear,TotalRevenuePriorYear,TotalRevenueCurrentYear,GrantsAndSimilarAmntsPriorYear,GrantsAndSimilarAmntsCY,BenefitsPaidToMembersPriorYear,BenefitsPaidToMembersCY,SalariesEtcPriorYear,SalariesEtcCurrentYear,TotalProfFundrsngExpPriorYear,TotalProfFundrsngExpCY,TotalFundrsngExpCurrentYear,OtherExpensePriorYear,OtherExpensesCurrentYear,TotalExpensesPriorYear,TotalExpensesCurrentYear,RevenuesLessExpensesPriorYear,RevenuesLessExpensesCY,TotalAssetsBOY,TotalAssetsEOY,TotalLiabilitiesBOY,TotalLiabilitiesEOY,NetAssetsOrFundBalancesBOY,NetAssetsOrFundBalancesEOY,ProfessionalFundraising,FundraisingActivities,Gaming,NbrVotingGoverningBodyMembers,NumberIndependentVotingMembers,FamilyOrBusinessRelationship,DelegationOfManagementDuties,ChangesToOrganizingDocs,MaterialDiversionOrMisuse,MembersOrStockholders,ElectionOfBoardMembers,DecisionsSubjectToApproval,MinutesOfGoverningBody,MinutesOfCommittees,OfficerMailingAddress,LocalChapters,Form990ProvidedToGoverningBody,ConflictOfInterestPolicy,AnnualDisclosureCoveredPersons,RegularMonitoringEnforcement,WhistleblowerPolicy,DocumentRetentionPolicy,CompensationProcessCEO,CompensationProcessOther,InvestmentInJointVenture,StatesWhereCopyOfReturnIsFiled,UponRequest,NoListedPersonsCompensated,TotalReportableCompFromOrg,TotalReportableCompFrmRltdOrgs,TotalOtherCompensation,NumberIndividualsGT100K,FormersListed,TotalCompGT150K,CompensationFromOtherSources,NumberOfContractorsGT100K,AllOtherContributions,TotalContributions,TotalOtherRevenue,TotalRevenue,FeesForServicesLegal,FeesForServicesAccounting,TotalFunctionalExpenses,SavingsAndTempCashInvestments,LandBuildingsEquipmentBasis,LandBldgEquipmentAccumDeprec,TotalAssets,OtherLiabilities,FollowSFAS117,ReconcilationRevenueExpenses,MethodOfAccountingAccrual,AccountantCompileOrReview,FSAudited,AuditCommittee,FederalGrantAuditRequired,AllAffiliatesIncluded,GroupExemptionNumber,PoliciesReferenceChapters,WrittenPolicyOrProcedure,TotalProgramServiceRevenue,CompCurrentOfficersDirectors,CompDisqualPersons,OtherSalariesAndWages,PensionPlanContributions,OtherEmployeeBenefits,PayrollTaxes,FeesForServicesManagement,FeesForServicesLobbying,FeesForServicesProfFundraising,FeesForServicesInvstMgmntFees,FeesForServicesOther,CashNonInterestBearing,MortNotesPyblSecuredInvestProp,FederalGrantAuditPerformed,LoansFromOfficersDirectors,MethodOfAccountingCash,TaxExemptBondLiabilities,OtherWebsite,FundraisingEvents,CntrbtnsRprtdFundraisingEvents,RelatedOrganizations,GrossIncomeFundraisingEvents,FundraisingDirectExpenses,FederatedCampaigns,GovernmentGrants,MethodOfAccountingOther,GrossSalesOfInventory,CostOfGoodsSold,DoNotFollowSFAS117,RetainedEarningsEndowmentEtc,InitialReturn,MembershipDues,GrossIncomeGaming,GamingDirectExpenses,NoncashContributions,OwnWebsite,UnsecuredNotesLoansPayable,TypeOfOrganizationOther,Organization501c,TypeOfOrganizationTrust,TypeOfOrganizationAssociation,CountryLegalDomicile,AmendedReturn,TypeOfOrgOtherDescription,TerminatedReturn,TerminationOrContraction,SpecialConditionDescription,Organization4947a1,ReconciliationUnrealizedInvest,ReconcilationPriorAdjustment,ReconcilationDonatedServices,ReconcilationInvestExpenses,PrincipalOfficerNm,GrossReceiptsAmt,GroupReturnForAffiliatesInd,Organization501c3Ind,TypeOfOrganizationCorpInd,FormationYr,LegalDomicileStateCd,ActivityOrMissionDesc,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,TotalEmployeeCnt,TotalGrossUBIAmt,CYContributionsGrantsAmt,CYProgramServiceRevenueAmt,CYInvestmentIncomeAmt,CYOtherRevenueAmt,CYTotalRevenueAmt,CYGrantsAndSimilarPaidAmt,CYBenefitsPaidToMembersAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalProfFndrsngExpnsAmt,CYTotalFundraisingExpenseAmt,CYOtherExpensesAmt,CYTotalExpensesAmt,CYRevenuesLessExpensesAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalLiabilitiesEOYAmt,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,ProfessionalFundraisingInd,FundraisingActivitiesInd,GamingActivitiesInd,GoverningBodyVotingMembersCnt,IndependentVotingMemberCnt,FamilyOrBusinessRlnInd,DelegationOfMgmtDutiesInd,ChangeToOrgDocumentsInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,ElectionOfBoardMembersInd,DecisionsSubjectToApprovaInd,MinutesOfGoverningBodyInd,MinutesOfCommitteesInd,OfficerMailingAddressInd,LocalChaptersInd,Form990ProvidedToGvrnBodyInd,ConflictOfInterestPolicyInd,WhistleblowerPolicyInd,DocumentRetentionPolicyInd,CompensationProcessCEOInd,CompensationProcessOtherInd,InvestmentInJointVentureInd,StatesWhereCopyOfReturnIsFldCd,NoListedPersonsCompensatedInd,FormerOfcrEmployeesListedInd,TotalCompGreaterThan150KInd,CompensationFromOtherSrcsInd,MembershipDuesAmt,FundraisingAmt,AllOtherContributionsAmt,TotalContributionsAmt,OtherRevenueTotalAmt,TotalRevenueGrp,FeesForServicesAccountingGrp,TotalFunctionalExpensesGrp,CashNonInterestBearingGrp,TotalAssetsGrp,OrgDoesNotFollowSFAS117Ind,RtnEarnEndowmentIncmOthFndsGrp,ReconcilationRevenueExpnssAmt,MethodOfAccountingCashInd,AccountantCompileOrReviewInd,FSAuditedInd,FederalGrantAuditRequiredInd,WebsiteAddressTxt,TotalVolunteersCnt,NetUnrelatedBusTxblIncmAmt,PYContributionsGrantsAmt,PYProgramServiceRevenueAmt,PYInvestmentIncomeAmt,PYOtherRevenueAmt,PYTotalRevenueAmt,PYGrantsAndSimilarPaidAmt,PYBenefitsPaidToMembersAmt,PYSalariesCompEmpBnftPaidAmt,PYTotalProfFndrsngExpnsAmt,PYOtherExpensesAmt,PYTotalExpensesAmt,PYRevenuesLessExpensesAmt,TotalLiabilitiesBOYAmt,AnnualDisclosureCoveredPrsnInd,RegularMonitoringEnfrcInd,UponRequestInd,TotalReportableCompFromOrgAmt,TotReportableCompRltdOrgAmt,TotalOtherCompensationAmt,IndivRcvdGreaterThan100KCnt,CntrctRcvdGreaterThan100KCnt,GovernmentGrantsAmt,TotalProgramServiceRevenueAmt,FundraisingGrossIncomeAmt,ContriRptFundraisingEventAmt,FundraisingDirectExpensesAmt,GrossSalesOfInventoryAmt,CostOfGoodsSoldAmt,CompCurrentOfcrDirectorsGrp,OtherSalariesAndWagesGrp,PensionPlanContributionsGrp,OtherEmployeeBenefitsGrp,PayrollTaxesGrp,FeesForServicesOtherGrp,SavingsAndTempCashInvstGrp,LandBldgEquipCostOrOtherBssAmt,LandBldgEquipAccumDeprecAmt,MortgNotesPyblScrdInvstPropGrp,OtherLiabilitiesGrp,OrganizationFollowsSFAS117Ind,NetUnrlzdGainsLossesInvstAmt,AuditCommitteeInd,AllAffiliatesIncludedInd,CompDisqualPersonsGrp,FeesForServicesManagementGrp,FeesForServicesLegalGrp,FeesForServicesLobbyingGrp,FeesForSrvcInvstMgmntFeesGrp,MethodOfAccountingAccrualInd,NoncashContributionsAmt,TaxExemptBondLiabilitiesGrp,LoansFromOfficersDirectorsGrp,UnsecuredNotesLoansPayableGrp,PriorPeriodAdjustmentsAmt,FederalGrantAuditPerformedInd,PoliciesReferenceChaptersInd,OtherWebsiteInd,AddressChangeInd,WrittenPolicyOrProcedureInd,RelatedOrganizationsAmt,OwnWebsiteInd,DonatedServicesAndUseFcltsAmt,LegalDomicileCountryCd,TypeOfOrganizationTrustInd,FinalReturnInd,ContractTerminationInd,GroupExemptionNum,FederatedCampaignsAmt,TypeOfOrganizationOtherInd,OtherOrganizationDsc,TypeOfOrganizationAssocInd,InitialReturnInd,GamingGrossIncomeAmt,GamingDirectExpensesAmt,MethodOfAccountingOtherInd,InvestmentExpenseAmt,Organization501cInd,Organization4947a1NotPFInd,AmendedReturnInd,SpecialConditionDesc,fiscal_year
1895014,COMMUNICATION WORKERS OF AMERICA LOCAL 6201,https://s3.amazonaws.com/irs-form-990/202022309349303577_public.xml,93493230035770.0,201909.0,750817126.0,"{'@binaryAttachmentCnt': '0', 'ReturnTs': '2020-08-17T17:18:05-07:00', 'TaxPeriodEndDt': '2019-09-30', 'PreparerFirmGrp': {'PreparerFirmEIN': '274601790', 'PreparerFirmName': {'BusinessNameLine1Txt': 'Daryl C Soward CPA'}, 'PreparerUSAddress': {'...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,{'TotalAmt': '0'},,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,496889,false,,,,TX,Local Labor Union Representing Members of the Communication Workers of America.,4,0,41,0,474396,0,20351,2142,496889,0,0,296767,0,0,246540,543307,-46418,3167749,3121837,817,3167438,3121020,false,false,false,4,0,false,false,false,false,true,true,true,true,true,false,false,false,false,false,false,false,false,false,,,false,false,false,474396.0,,,474396.0,2142.0,"{'TotalRevenueColumnAmt': '496889', 'RelatedOrExemptFuncIncomeAmt': '22493'}","{'TotalAmt': '5913', 'ManagementAndGeneralAmt': '5913'}","{'TotalAmt': '543307', 'ProgramServicesAmt': '123662', 'ManagementAndGeneralAmt': '419645', 'FundraisingAmt': '0'}","{'BOYAmt': '140194', 'EOYAmt': '98702'}","{'BOYAmt': '3167749', 'EOYAmt': '3121837'}",,,-46418,X,false,true,false,www.cwa6201.org,,,503335,,6322,10145,519802,,,308373,,243420,551793,-31991,311,false,false,,192892,,,0,0,,0,,,,,,"{'TotalAmt': '192892', 'ProgramServicesAmt': '86983', 'ManagementAndGeneralAmt': '105909'}","{'TotalAmt': '73967', 'ProgramServicesAmt': '4086', 'ManagementAndGeneralAmt': '69881'}","{'TotalAmt': '4654', 'ManagementAndGeneralAmt': '4654'}","{'TotalAmt': '1608', 'ManagementAndGeneralAmt': '1608'}","{'TotalAmt': '23646', 'ProgramServicesAmt': '8078', 'ManagementAndGeneralAmt': '15568'}",{'TotalAmt': '0'},"{'BOYAmt': '2006329', 'EOYAmt': '2041045'}",1607660,625570,,"{'BOYAmt': '311', 'EOYAmt': '817'}",X,,False,False,{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},,,,,,,,,,,,,,,,,,,,,,,X,,,,,,"{'@organization501cTypeTxt': '5', '#text': 'X'}",,,,2019.0
1895015,,,,,,"{'@binaryAttachmentCnt': '0', 'ReturnTs': '2019-11-15T10:29:26-06:00', 'TaxPeriodEndDt': '2018-12-31', 'PreparerFirmGrp': {'PreparerFirmEIN': '770051130', 'PreparerFirmName': {'BusinessNameLine1Txt': 'ABBOTT STRINGHAM & LYNCH'}, 'PreparerUSAddres...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,JOHN MORA,1075,0,X,X,2006.0,CA,OPERATING YOUTH SPORTS PROGRAMS FOR THE PUBLIC BENEFIT,3,0,0,0,0,1075,0,0,1075,0,0,0,0,0,10378,10378,-9303,32567,23264,0,32567,23264,0,0,0,3,0,0,0,0,0,0,0,0,1,1,0,0,0,1,1,1,0,0,0,CA,X,0,0,0,,,,,,"{'TotalRevenueColumnAmt': '1075', 'RelatedOrExemptFuncIncomeAmt': '1075', 'UnrelatedBusinessRevenueAmt': '0', 'ExclusionAmt': '0'}",,"{'TotalAmt': '10378', 'ProgramServicesAmt': '8649', 'ManagementAndGeneralAmt': '1729', 'FundraisingAmt': '0'}","{'BOYAmt': '6630', 'EOYAmt': '4245'}","{'BOYAmt': '32567', 'EOYAmt': '23264'}",,,-9303,X,0,0,0,WWW.PLAYFLAGFOOTBALL.COM,300.0,0.0,0,250.0,0,0,250,0.0,0.0,0,0.0,8092,8092,-7842,0,1,1,X,0,0.0,0.0,0,0,,1075,,,,,,,,,,,,,215130,196111,,,X,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


### Create and save list of EINs for BMF File
In previous round I believe there were only 296,334 EINs (though that may have only been for valid BMF EINs.

In [36]:
ein_list = df['EIN'].tolist()
print(len(ein_list))
print(len(set(ein_list)))
ein_list = list(set(ein_list))
print(len(ein_list))

1895016
336695
336695


In [37]:
import json
with open('ein_list.json', 'w') as fp:
    json.dump(ein_list, fp)

<br> Add BMF variables in the following notebook:
*IRS 990 e-File Data -- CONTROL VARIABLES (A4) -- Merge in BMF Data (NTEE, MSA, etc) - uses 'combine_first' Combine Columns (Python 3.6).ipynb*

### Collapse

In [40]:
def agg_funcs(x):
    names = {
        #'name': x['variable_name_new'].head(1).values[0],
        'original_names':  list(set(x['MongoDB_Name'].tolist())),
        'data_type_xsd': x['data_type_xsd'].head(1).values[0]
        }
    #THE FOLLOWING SHORTCUT WORKS BUT CHANGES THE ORDER OF THE COLUMNS
    #return pd.Series(names, index = list(names.keys()))
    return pd.Series(names, index=['original_names', 'data_type_xsd'])
new_variables_df = concordance[:].groupby(['variable_name_new']).apply(agg_funcs)
new_variables_df = new_variables_df.reset_index()
print('# of variables:', len(new_variables_df))
new_variables_df[:5]

# of variables: 193


Unnamed: 0,variable_name_new,original_names,data_type_xsd
0,F9_00_HD_ADDR_CHANGE,"[AddressChange, AddressChangeInd]",CheckboxType
1,F9_00_HD_AMENDED_RETURN,"[AmendedReturn, AmendedReturnInd]",CheckboxType
2,F9_00_HD_BUILD_TIME_STAMP,[BuildTS],
3,F9_00_HD_CTRY_OF_DOMICILE,"[CountryLegalDomicile, LegalDomicileCountryCd]",CountryType
4,F9_00_HD_EXEMPT_STATUS_4847A1,"[Organization4947a1NotPFInd, Organization4947a1]",CheckboxType


In [41]:
new_variables_df['len'] = new_variables_df['original_names'].apply(lambda x: len(x))
print(new_variables_df['len'].value_counts(), '\n')
new_variables_df[:4]

2    189
1      4
Name: len, dtype: int64 



Unnamed: 0,variable_name_new,original_names,data_type_xsd,len
0,F9_00_HD_ADDR_CHANGE,"[AddressChange, AddressChangeInd]",CheckboxType,2
1,F9_00_HD_AMENDED_RETURN,"[AmendedReturn, AmendedReturnInd]",CheckboxType,2
2,F9_00_HD_BUILD_TIME_STAMP,[BuildTS],,1
3,F9_00_HD_CTRY_OF_DOMICILE,"[CountryLegalDomicile, LegalDomicileCountryCd]",CountryType,2


## Flatten *ReturnHeader* column

### 2/14/2020 --  THIS IS A NEW APPROACH --> FLATTEN THEN REMOVE NON-USED COLUMNS

###  BE SURE TO FOLLOW THROUGH WITH CHANGES TO THE FOLLOWING VARIABLES IN SUBSEQUENT NOTEBOOKS:
    'ReturnHeader.TaxYear': 1, 'ReturnHeader.TaxYr': 1,
    'ReturnHeader.TaxPeriodEndDate': 1, 'ReturnHeader.TaxPeriodEndDt': 1,  

In [49]:
print(df.columns.tolist()[-5:])

['Organization501cInd', 'Organization4947a1NotPFInd', 'AmendedReturnInd', 'SpecialConditionDesc', 'fiscal_year']


In [50]:
%%time
print("Number of columns:", len(df.columns))
df = pd.concat([df.drop(['ReturnHeader'], axis=1), df['ReturnHeader'].apply(pd.Series)], axis=1)
print("Number of columns:", len(df.columns))
df[:2]

Number of columns: 324
Number of columns: 346
Wall time: 15min 46s


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,TypeOfOrganizationCorporation,YearFormation,StateLegalDomicile,ActivityOrMissionDescription,NbrVotingMembersGoverningBody,NbrIndependentVotingMembers,TotalNbrEmployees,TotalNbrVolunteers,TotalGrossUBI,NetUnrelatedBusinessTxblIncome,ContributionsGrantsPriorYear,ContributionsGrantsCurrentYear,ProgramServiceRevenuePriorYear,ProgramServiceRevenueCY,InvestmentIncomePriorYear,InvestmentIncomeCurrentYear,OtherRevenuePriorYear,OtherRevenueCurrentYear,TotalRevenuePriorYear,TotalRevenueCurrentYear,GrantsAndSimilarAmntsPriorYear,GrantsAndSimilarAmntsCY,BenefitsPaidToMembersPriorYear,BenefitsPaidToMembersCY,SalariesEtcPriorYear,SalariesEtcCurrentYear,TotalProfFundrsngExpPriorYear,TotalProfFundrsngExpCY,TotalFundrsngExpCurrentYear,OtherExpensePriorYear,OtherExpensesCurrentYear,TotalExpensesPriorYear,TotalExpensesCurrentYear,RevenuesLessExpensesPriorYear,RevenuesLessExpensesCY,TotalAssetsBOY,TotalAssetsEOY,TotalLiabilitiesBOY,TotalLiabilitiesEOY,NetAssetsOrFundBalancesBOY,NetAssetsOrFundBalancesEOY,ProfessionalFundraising,FundraisingActivities,Gaming,NbrVotingGoverningBodyMembers,NumberIndependentVotingMembers,FamilyOrBusinessRelationship,DelegationOfManagementDuties,ChangesToOrganizingDocs,MaterialDiversionOrMisuse,MembersOrStockholders,ElectionOfBoardMembers,DecisionsSubjectToApproval,MinutesOfGoverningBody,MinutesOfCommittees,OfficerMailingAddress,LocalChapters,Form990ProvidedToGoverningBody,ConflictOfInterestPolicy,AnnualDisclosureCoveredPersons,RegularMonitoringEnforcement,WhistleblowerPolicy,DocumentRetentionPolicy,CompensationProcessCEO,CompensationProcessOther,InvestmentInJointVenture,StatesWhereCopyOfReturnIsFiled,UponRequest,NoListedPersonsCompensated,TotalReportableCompFromOrg,TotalReportableCompFrmRltdOrgs,TotalOtherCompensation,NumberIndividualsGT100K,FormersListed,TotalCompGT150K,CompensationFromOtherSources,NumberOfContractorsGT100K,AllOtherContributions,TotalContributions,TotalOtherRevenue,TotalRevenue,FeesForServicesLegal,FeesForServicesAccounting,TotalFunctionalExpenses,SavingsAndTempCashInvestments,LandBuildingsEquipmentBasis,LandBldgEquipmentAccumDeprec,TotalAssets,OtherLiabilities,FollowSFAS117,ReconcilationRevenueExpenses,MethodOfAccountingAccrual,AccountantCompileOrReview,FSAudited,AuditCommittee,FederalGrantAuditRequired,AllAffiliatesIncluded,GroupExemptionNumber,PoliciesReferenceChapters,WrittenPolicyOrProcedure,TotalProgramServiceRevenue,CompCurrentOfficersDirectors,CompDisqualPersons,OtherSalariesAndWages,PensionPlanContributions,OtherEmployeeBenefits,PayrollTaxes,FeesForServicesManagement,FeesForServicesLobbying,F9_09_PC_FEES_FOR_SVCE_FR_TOT,FeesForServicesInvstMgmntFees,FeesForServicesOther,CashNonInterestBearing,MortNotesPyblSecuredInvestProp,FederalGrantAuditPerformed,LoansFromOfficersDirectors,MethodOfAccountingCash,TaxExemptBondLiabilities,OtherWebsite,FundraisingEvents,CntrbtnsRprtdFundraisingEvents,RelatedOrganizations,GrossIncomeFundraisingEvents,FundraisingDirectExpenses,FederatedCampaigns,GovernmentGrants,MethodOfAccountingOther,GrossSalesOfInventory,CostOfGoodsSold,DoNotFollowSFAS117,RetainedEarningsEndowmentEtc,InitialReturn,MembershipDues,GrossIncomeGaming,GamingDirectExpenses,NoncashContributions,OwnWebsite,UnsecuredNotesLoansPayable,TypeOfOrganizationOther,Organization501c,TypeOfOrganizationTrust,TypeOfOrganizationAssociation,CountryLegalDomicile,AmendedReturn,TypeOfOrgOtherDescription,TerminatedReturn,TerminationOrContraction,SpecialConditionDescription,Organization4947a1,ReconciliationUnrealizedInvest,ReconcilationPriorAdjustment,ReconcilationDonatedServices,ReconcilationInvestExpenses,PrincipalOfficerNm,GrossReceiptsAmt,GroupReturnForAffiliatesInd,Organization501c3Ind,TypeOfOrganizationCorpInd,FormationYr,LegalDomicileStateCd,ActivityOrMissionDesc,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,TotalEmployeeCnt,TotalGrossUBIAmt,CYContributionsGrantsAmt,CYProgramServiceRevenueAmt,CYInvestmentIncomeAmt,CYOtherRevenueAmt,CYTotalRevenueAmt,CYGrantsAndSimilarPaidAmt,CYBenefitsPaidToMembersAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalProfFndrsngExpnsAmt,CYTotalFundraisingExpenseAmt,CYOtherExpensesAmt,CYTotalExpensesAmt,CYRevenuesLessExpensesAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalLiabilitiesEOYAmt,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,ProfessionalFundraisingInd,FundraisingActivitiesInd,GamingActivitiesInd,GoverningBodyVotingMembersCnt,IndependentVotingMemberCnt,FamilyOrBusinessRlnInd,DelegationOfMgmtDutiesInd,ChangeToOrgDocumentsInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,ElectionOfBoardMembersInd,DecisionsSubjectToApprovaInd,MinutesOfGoverningBodyInd,MinutesOfCommitteesInd,OfficerMailingAddressInd,LocalChaptersInd,Form990ProvidedToGvrnBodyInd,ConflictOfInterestPolicyInd,WhistleblowerPolicyInd,DocumentRetentionPolicyInd,CompensationProcessCEOInd,CompensationProcessOtherInd,InvestmentInJointVentureInd,StatesWhereCopyOfReturnIsFldCd,NoListedPersonsCompensatedInd,FormerOfcrEmployeesListedInd,TotalCompGreaterThan150KInd,CompensationFromOtherSrcsInd,MembershipDuesAmt,FundraisingAmt,AllOtherContributionsAmt,TotalContributionsAmt,OtherRevenueTotalAmt,TotalRevenueGrp,FeesForServicesAccountingGrp,TotalFunctionalExpensesGrp,CashNonInterestBearingGrp,TotalAssetsGrp,OrgDoesNotFollowSFAS117Ind,RtnEarnEndowmentIncmOthFndsGrp,ReconcilationRevenueExpnssAmt,MethodOfAccountingCashInd,AccountantCompileOrReviewInd,FSAuditedInd,FederalGrantAuditRequiredInd,WebsiteAddressTxt,TotalVolunteersCnt,NetUnrelatedBusTxblIncmAmt,PYContributionsGrantsAmt,PYProgramServiceRevenueAmt,PYInvestmentIncomeAmt,PYOtherRevenueAmt,PYTotalRevenueAmt,PYGrantsAndSimilarPaidAmt,PYBenefitsPaidToMembersAmt,PYSalariesCompEmpBnftPaidAmt,PYTotalProfFndrsngExpnsAmt,PYOtherExpensesAmt,PYTotalExpensesAmt,PYRevenuesLessExpensesAmt,TotalLiabilitiesBOYAmt,AnnualDisclosureCoveredPrsnInd,RegularMonitoringEnfrcInd,UponRequestInd,TotalReportableCompFromOrgAmt,TotReportableCompRltdOrgAmt,TotalOtherCompensationAmt,IndivRcvdGreaterThan100KCnt,CntrctRcvdGreaterThan100KCnt,GovernmentGrantsAmt,TotalProgramServiceRevenueAmt,FundraisingGrossIncomeAmt,ContriRptFundraisingEventAmt,FundraisingDirectExpensesAmt,GrossSalesOfInventoryAmt,CostOfGoodsSoldAmt,CompCurrentOfcrDirectorsGrp,OtherSalariesAndWagesGrp,PensionPlanContributionsGrp,OtherEmployeeBenefitsGrp,PayrollTaxesGrp,FeesForServicesOtherGrp,SavingsAndTempCashInvstGrp,LandBldgEquipCostOrOtherBssAmt,LandBldgEquipAccumDeprecAmt,MortgNotesPyblScrdInvstPropGrp,OtherLiabilitiesGrp,OrganizationFollowsSFAS117Ind,NetUnrlzdGainsLossesInvstAmt,AuditCommitteeInd,AllAffiliatesIncludedInd,CompDisqualPersonsGrp,FeesForServicesManagementGrp,FeesForServicesLegalGrp,FeesForServicesLobbyingGrp,FeesForSrvcInvstMgmntFeesGrp,MethodOfAccountingAccrualInd,NoncashContributionsAmt,TaxExemptBondLiabilitiesGrp,LoansFromOfficersDirectorsGrp,UnsecuredNotesLoansPayableGrp,PriorPeriodAdjustmentsAmt,FederalGrantAuditPerformedInd,PoliciesReferenceChaptersInd,OtherWebsiteInd,AddressChangeInd,WrittenPolicyOrProcedureInd,RelatedOrganizationsAmt,OwnWebsiteInd,DonatedServicesAndUseFcltsAmt,LegalDomicileCountryCd,TypeOfOrganizationTrustInd,FinalReturnInd,ContractTerminationInd,GroupExemptionNum,FederatedCampaignsAmt,TypeOfOrganizationOtherInd,OtherOrganizationDsc,TypeOfOrganizationAssocInd,InitialReturnInd,GamingGrossIncomeAmt,GamingDirectExpensesAmt,MethodOfAccountingOtherInd,InvestmentExpenseAmt,Organization501cInd,Organization4947a1NotPFInd,AmendedReturnInd,SpecialConditionDesc,fiscal_year,@binaryAttachmentCount,Timestamp,TaxPeriodEndDate,PreparerFirm,ReturnType,TaxPeriodBeginDate,Filer,Officer,Preparer,TaxYear,BuildTS,DisasterRelief,@binaryAttachmentCnt,ReturnTs,TaxPeriodEndDt,PreparerFirmGrp,ReturnTypeCd,TaxPeriodBeginDt,BusinessOfficerGrp,PreparerPersonGrp,TaxYr,DisasterReliefTxt,FilingSecurityInformation
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,X,MICHAEL ANTON,1473903,0,X,,X,1992,PA,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,10,10,0,0.0,0,0.0,1044925.0,1439340,0,0,30447,33563,0.0,1000,1075372,1473903,638637.0,925000,0.0,0,0,0,0.0,0,195892,243131,459751,881768,1384751,193604,89152,1925215,2440859,171810,450430,1753405,1990429,0,0,0,10,10,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,0,0,"[PA, NJ, DE]",X,X,0.0,0,0,0,0,0,0,0,1439340.0,1439340,1000,"{'TotalRevenueColumn': '1473903', 'RelatedOrExemptFunctionIncome': '1000', 'UnrelatedBusinessRevenue': '0', 'ExclusionAmount': '33563'}","{'Total': '215', 'ManagementAndGeneral': '215'}","{'Total': '21675', 'ManagementAndGeneral': '21675'}","{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}","{'BOY': '332660', 'EOY': '270700'}",256845,86228,"{'BOY': '1925215', 'EOY': '2440859'}","{'BOY': '51640', 'EOY': '240077'}",X,89152,X,0,1,1,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2010,0,2011-11-09T06:41:09-06:00,2010-12-31,"{'PreparerFirmBusinessName': {'BusinessNameLine1': 'CONCANNON MILLER & CO PC'}, 'PreparerFirmUSAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY SUITE 30', 'City': 'BETHLEHEM', 'State': 'PA', 'ZIPCode': '180172285'}}",990,2010-01-01,"{'EIN': '232705170', 'Name': {'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}, 'NameControl': 'RONA', 'Phone': '8565826843', 'USAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300...","{'Name': 'ROBERT TRAA', 'Title': 'TREASURER', 'Phone': '8565826843', 'DateSigned': '2011-11-04', 'AuthorizeThirdParty': '1'}","{'Name': 'E BARRY HETZEL CPA', 'Phone': '6104335501'}",2010,2016-02-24 21:20:13Z,,,,,,,,,,,,
1,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,201106,581805618,,,266420,false,X,,X,1993,WY,PROVIDE HOUSING FOR THE ELDERLY AND THE DISABLED UNDER SECTION 202 OF THE NATIONAL HOUSING ACT UNDER AN AGREEMENT WITH THE DEPARTMENT OF HUD.,19,13,0,,0,,,0,222839,265592,1425,828,,0,224264,266420,,0,,0,71405,82955,,0,0,189785,222550,261190,305505,-36926,-39085,1455332,1433342,17482,34577,1437850,1398765,false,false,false,19,13,true,true,false,false,false,true,true,true,true,false,false,true,true,true,true,false,false,true,true,false,,X,,,1180355,411648,0,true,true,false,0,,0,0,"{'TotalRevenueColumn': '266420', 'RelatedOrExemptFunctionIncome': '266420'}",{'Total': '0'},"{'Total': '7500', 'ManagementAndGeneral': '7500'}","{'Total': '305505', 'ProgramServices': '276405', 'ManagementAndGeneral': '29100', 'Fundraising': '0'}",{'EOY': '0'},2187206,904332,"{'BOY': '1455332', 'EOY': '1433342'}","{'BOY': '9203', 'EOY': '11349'}",X,-39085,X,false,true,true,true,False,1736.0,False,False,265592.0,{'Total': '0'},{'Total': '0'},"{'Total': '59440', 'ProgramServices': '59440'}",{'Total': '0'},"{'Total': '17714', 'ProgramServices': '17714'}","{'Total': '5801', 'ProgramServices': '5801'}","{'Total': '21600', 'ManagementAndGeneral': '21600'}",{'Total': '0'},{'Total': '0'},{'Total': '0'},{'Total': '0'},"{'BOY': '250', 'EOY': '22261'}","{'BOY': '6219', 'EOY': '7035'}",True,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2011,0,2011-11-09T07:32:06-08:00,2011-06-30,"{'PreparerFirmBusinessName': {'BusinessNameLine1': 'MADDOX & ASSOCIATES APC'}, 'PreparerFirmUSAddress': {'AddressLine1': '5627 BANKERS AVE BLDG 2', 'City': 'BATON ROUGE', 'State': 'LA', 'ZIPCode': '708082610'}}",990,2010-07-01,"{'EIN': '581805618', 'Name': {'BusinessNameLine1': 'TORRINGTON VOA ELDERLY HOUSING INC', 'BusinessNameLine2': 'BELL PARK TOWER'}, 'NameControl': 'TORR', 'Phone': '7033415000', 'USAddress': {'AddressLine1': '1660 DUKE STREET', 'City': 'ALEXANDRIA'...","{'Name': 'THOMAS D TURNBULL', 'Title': 'ASST. SEC/TREAS', 'DateSigned': '2011-11-09'}","{'Name': 'WILLIAM B BEALE', 'Phone': '2259263360'}",2010,2016-02-24 21:20:13Z,,,,,,,,,,,,


In [63]:
print(df.columns.tolist()[-23:])

['@binaryAttachmentCount', 'Timestamp', 'TaxPeriodEndDate', 'PreparerFirm', 'ReturnType', 'TaxPeriodBeginDate', 'Filer', 'Officer', 'Preparer', 'TaxYear', 'BuildTS', 'DisasterRelief', '@binaryAttachmentCnt', 'ReturnTs', 'TaxPeriodEndDt', 'PreparerFirmGrp', 'ReturnTypeCd', 'TaxPeriodBeginDt', 'BusinessOfficerGrp', 'PreparerPersonGrp', 'TaxYr', 'DisasterReliefTxt', 'FilingSecurityInformation']


In [69]:
set(df.columns.tolist()[-23:]) - set(['@binaryAttachmentCount', 'PreparerFirm', 'ReturnType', 'TaxPeriodBeginDate', 'Preparer', 
             'DisasterRelief', '@binaryAttachmentCnt', 'PreparerFirmGrp', 'ReturnTypeCd', 'TaxPeriodBeginDt', 
             'PreparerPersonGrp', 'DisasterReliefTxt', 'FilingSecurityInformation'])

{'BuildTS',
 'BusinessOfficerGrp',
 'Filer',
 'Officer',
 'ReturnTs',
 'TaxPeriodEndDate',
 'TaxPeriodEndDt',
 'TaxYear',
 'TaxYr',
 'Timestamp'}

In [71]:
%%time 
#print([c for c in df.columns.tolist() if c not in mongo_cols])
omit_cols = ['@binaryAttachmentCount', 'PreparerFirm', 'ReturnType', 'TaxPeriodBeginDate', 'Preparer', 
             'DisasterRelief', '@binaryAttachmentCnt', 'PreparerFirmGrp', 'ReturnTypeCd', 'TaxPeriodBeginDt', 
             'PreparerPersonGrp', 'DisasterReliefTxt', 'FilingSecurityInformation']
print('omit_cols:', len(omit_cols))
print("Number of columns:", len(df.columns))
df = df[[c for c in df.columns.tolist() if c not in omit_cols]]
print(len(df))
print("Number of columns:", len(df.columns))
df[:1]

omit_cols: 13
346
1895016
333
Wall time: 57.7 s


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,TypeOfOrganizationCorporation,YearFormation,StateLegalDomicile,ActivityOrMissionDescription,NbrVotingMembersGoverningBody,NbrIndependentVotingMembers,TotalNbrEmployees,TotalNbrVolunteers,TotalGrossUBI,NetUnrelatedBusinessTxblIncome,ContributionsGrantsPriorYear,ContributionsGrantsCurrentYear,ProgramServiceRevenuePriorYear,ProgramServiceRevenueCY,InvestmentIncomePriorYear,InvestmentIncomeCurrentYear,OtherRevenuePriorYear,OtherRevenueCurrentYear,TotalRevenuePriorYear,TotalRevenueCurrentYear,GrantsAndSimilarAmntsPriorYear,GrantsAndSimilarAmntsCY,BenefitsPaidToMembersPriorYear,BenefitsPaidToMembersCY,SalariesEtcPriorYear,SalariesEtcCurrentYear,TotalProfFundrsngExpPriorYear,TotalProfFundrsngExpCY,TotalFundrsngExpCurrentYear,OtherExpensePriorYear,OtherExpensesCurrentYear,TotalExpensesPriorYear,TotalExpensesCurrentYear,RevenuesLessExpensesPriorYear,RevenuesLessExpensesCY,TotalAssetsBOY,TotalAssetsEOY,TotalLiabilitiesBOY,TotalLiabilitiesEOY,NetAssetsOrFundBalancesBOY,NetAssetsOrFundBalancesEOY,ProfessionalFundraising,FundraisingActivities,Gaming,NbrVotingGoverningBodyMembers,NumberIndependentVotingMembers,FamilyOrBusinessRelationship,DelegationOfManagementDuties,ChangesToOrganizingDocs,MaterialDiversionOrMisuse,MembersOrStockholders,ElectionOfBoardMembers,DecisionsSubjectToApproval,MinutesOfGoverningBody,MinutesOfCommittees,OfficerMailingAddress,LocalChapters,Form990ProvidedToGoverningBody,ConflictOfInterestPolicy,AnnualDisclosureCoveredPersons,RegularMonitoringEnforcement,WhistleblowerPolicy,DocumentRetentionPolicy,CompensationProcessCEO,CompensationProcessOther,InvestmentInJointVenture,StatesWhereCopyOfReturnIsFiled,UponRequest,NoListedPersonsCompensated,TotalReportableCompFromOrg,TotalReportableCompFrmRltdOrgs,TotalOtherCompensation,NumberIndividualsGT100K,FormersListed,TotalCompGT150K,CompensationFromOtherSources,NumberOfContractorsGT100K,AllOtherContributions,TotalContributions,TotalOtherRevenue,TotalRevenue,FeesForServicesLegal,FeesForServicesAccounting,TotalFunctionalExpenses,SavingsAndTempCashInvestments,LandBuildingsEquipmentBasis,LandBldgEquipmentAccumDeprec,TotalAssets,OtherLiabilities,FollowSFAS117,ReconcilationRevenueExpenses,MethodOfAccountingAccrual,AccountantCompileOrReview,FSAudited,AuditCommittee,FederalGrantAuditRequired,AllAffiliatesIncluded,GroupExemptionNumber,PoliciesReferenceChapters,WrittenPolicyOrProcedure,TotalProgramServiceRevenue,CompCurrentOfficersDirectors,CompDisqualPersons,OtherSalariesAndWages,PensionPlanContributions,OtherEmployeeBenefits,PayrollTaxes,FeesForServicesManagement,FeesForServicesLobbying,F9_09_PC_FEES_FOR_SVCE_FR_TOT,FeesForServicesInvstMgmntFees,FeesForServicesOther,CashNonInterestBearing,MortNotesPyblSecuredInvestProp,FederalGrantAuditPerformed,LoansFromOfficersDirectors,MethodOfAccountingCash,TaxExemptBondLiabilities,OtherWebsite,FundraisingEvents,CntrbtnsRprtdFundraisingEvents,RelatedOrganizations,GrossIncomeFundraisingEvents,FundraisingDirectExpenses,FederatedCampaigns,GovernmentGrants,MethodOfAccountingOther,GrossSalesOfInventory,CostOfGoodsSold,DoNotFollowSFAS117,RetainedEarningsEndowmentEtc,InitialReturn,MembershipDues,GrossIncomeGaming,GamingDirectExpenses,NoncashContributions,OwnWebsite,UnsecuredNotesLoansPayable,TypeOfOrganizationOther,Organization501c,TypeOfOrganizationTrust,TypeOfOrganizationAssociation,CountryLegalDomicile,AmendedReturn,TypeOfOrgOtherDescription,TerminatedReturn,TerminationOrContraction,SpecialConditionDescription,Organization4947a1,ReconciliationUnrealizedInvest,ReconcilationPriorAdjustment,ReconcilationDonatedServices,ReconcilationInvestExpenses,PrincipalOfficerNm,GrossReceiptsAmt,GroupReturnForAffiliatesInd,Organization501c3Ind,TypeOfOrganizationCorpInd,FormationYr,LegalDomicileStateCd,ActivityOrMissionDesc,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,TotalEmployeeCnt,TotalGrossUBIAmt,CYContributionsGrantsAmt,CYProgramServiceRevenueAmt,CYInvestmentIncomeAmt,CYOtherRevenueAmt,CYTotalRevenueAmt,CYGrantsAndSimilarPaidAmt,CYBenefitsPaidToMembersAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalProfFndrsngExpnsAmt,CYTotalFundraisingExpenseAmt,CYOtherExpensesAmt,CYTotalExpensesAmt,CYRevenuesLessExpensesAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalLiabilitiesEOYAmt,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,ProfessionalFundraisingInd,FundraisingActivitiesInd,GamingActivitiesInd,GoverningBodyVotingMembersCnt,IndependentVotingMemberCnt,FamilyOrBusinessRlnInd,DelegationOfMgmtDutiesInd,ChangeToOrgDocumentsInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,ElectionOfBoardMembersInd,DecisionsSubjectToApprovaInd,MinutesOfGoverningBodyInd,MinutesOfCommitteesInd,OfficerMailingAddressInd,LocalChaptersInd,Form990ProvidedToGvrnBodyInd,ConflictOfInterestPolicyInd,WhistleblowerPolicyInd,DocumentRetentionPolicyInd,CompensationProcessCEOInd,CompensationProcessOtherInd,InvestmentInJointVentureInd,StatesWhereCopyOfReturnIsFldCd,NoListedPersonsCompensatedInd,FormerOfcrEmployeesListedInd,TotalCompGreaterThan150KInd,CompensationFromOtherSrcsInd,MembershipDuesAmt,FundraisingAmt,AllOtherContributionsAmt,TotalContributionsAmt,OtherRevenueTotalAmt,TotalRevenueGrp,FeesForServicesAccountingGrp,TotalFunctionalExpensesGrp,CashNonInterestBearingGrp,TotalAssetsGrp,OrgDoesNotFollowSFAS117Ind,RtnEarnEndowmentIncmOthFndsGrp,ReconcilationRevenueExpnssAmt,MethodOfAccountingCashInd,AccountantCompileOrReviewInd,FSAuditedInd,FederalGrantAuditRequiredInd,WebsiteAddressTxt,TotalVolunteersCnt,NetUnrelatedBusTxblIncmAmt,PYContributionsGrantsAmt,PYProgramServiceRevenueAmt,PYInvestmentIncomeAmt,PYOtherRevenueAmt,PYTotalRevenueAmt,PYGrantsAndSimilarPaidAmt,PYBenefitsPaidToMembersAmt,PYSalariesCompEmpBnftPaidAmt,PYTotalProfFndrsngExpnsAmt,PYOtherExpensesAmt,PYTotalExpensesAmt,PYRevenuesLessExpensesAmt,TotalLiabilitiesBOYAmt,AnnualDisclosureCoveredPrsnInd,RegularMonitoringEnfrcInd,UponRequestInd,TotalReportableCompFromOrgAmt,TotReportableCompRltdOrgAmt,TotalOtherCompensationAmt,IndivRcvdGreaterThan100KCnt,CntrctRcvdGreaterThan100KCnt,GovernmentGrantsAmt,TotalProgramServiceRevenueAmt,FundraisingGrossIncomeAmt,ContriRptFundraisingEventAmt,FundraisingDirectExpensesAmt,GrossSalesOfInventoryAmt,CostOfGoodsSoldAmt,CompCurrentOfcrDirectorsGrp,OtherSalariesAndWagesGrp,PensionPlanContributionsGrp,OtherEmployeeBenefitsGrp,PayrollTaxesGrp,FeesForServicesOtherGrp,SavingsAndTempCashInvstGrp,LandBldgEquipCostOrOtherBssAmt,LandBldgEquipAccumDeprecAmt,MortgNotesPyblScrdInvstPropGrp,OtherLiabilitiesGrp,OrganizationFollowsSFAS117Ind,NetUnrlzdGainsLossesInvstAmt,AuditCommitteeInd,AllAffiliatesIncludedInd,CompDisqualPersonsGrp,FeesForServicesManagementGrp,FeesForServicesLegalGrp,FeesForServicesLobbyingGrp,FeesForSrvcInvstMgmntFeesGrp,MethodOfAccountingAccrualInd,NoncashContributionsAmt,TaxExemptBondLiabilitiesGrp,LoansFromOfficersDirectorsGrp,UnsecuredNotesLoansPayableGrp,PriorPeriodAdjustmentsAmt,FederalGrantAuditPerformedInd,PoliciesReferenceChaptersInd,OtherWebsiteInd,AddressChangeInd,WrittenPolicyOrProcedureInd,RelatedOrganizationsAmt,OwnWebsiteInd,DonatedServicesAndUseFcltsAmt,LegalDomicileCountryCd,TypeOfOrganizationTrustInd,FinalReturnInd,ContractTerminationInd,GroupExemptionNum,FederatedCampaignsAmt,TypeOfOrganizationOtherInd,OtherOrganizationDsc,TypeOfOrganizationAssocInd,InitialReturnInd,GamingGrossIncomeAmt,GamingDirectExpensesAmt,MethodOfAccountingOtherInd,InvestmentExpenseAmt,Organization501cInd,Organization4947a1NotPFInd,AmendedReturnInd,SpecialConditionDesc,fiscal_year,Timestamp,TaxPeriodEndDate,Filer,Officer,TaxYear,BuildTS,ReturnTs,TaxPeriodEndDt,BusinessOfficerGrp,TaxYr
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,X,MICHAEL ANTON,1473903,0,X,,X,1992,PA,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,10,10,0,0,0,0,1044925,1439340,0,0,30447,33563,0,1000,1075372,1473903,638637,925000,0,0,0,0,0,0,195892,243131,459751,881768,1384751,193604,89152,1925215,2440859,171810,450430,1753405,1990429,0,0,0,10,10,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,0,0,"[PA, NJ, DE]",X,X,0,0,0,0,0,0,0,0,1439340,1439340,1000,"{'TotalRevenueColumn': '1473903', 'RelatedOrExemptFunctionIncome': '1000', 'UnrelatedBusinessRevenue': '0', 'ExclusionAmount': '33563'}","{'Total': '215', 'ManagementAndGeneral': '215'}","{'Total': '21675', 'ManagementAndGeneral': '21675'}","{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}","{'BOY': '332660', 'EOY': '270700'}",256845,86228,"{'BOY': '1925215', 'EOY': '2440859'}","{'BOY': '51640', 'EOY': '240077'}",X,89152,X,0,1,1,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2010,2011-11-09T06:41:09-06:00,2010-12-31,"{'EIN': '232705170', 'Name': {'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}, 'NameControl': 'RONA', 'Phone': '8565826843', 'USAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300...","{'Name': 'ROBERT TRAA', 'Title': 'TREASURER', 'Phone': '8565826843', 'DateSigned': '2011-11-04', 'AuthorizeThirdParty': '1'}",2010,2016-02-24 21:20:13Z,,,,


#### Save DF

In [72]:
%%time
df.to_pickle('all filings nov. 2020 - all control variables (renamed).pkl.gz', compression='gzip')

Wall time: 28min 49s


### Process the two *ReturnHeader* variables separately
- This is the *old* way of doing it. 

In [19]:
#new_variables_df[new_variables_df['variable_name_new'].isin(['F9_00_HD_TAX_PER_END', 'F9_00_HD_TAX_YEAR'])]

Unnamed: 0,variable_name_new,original_names,data_type_xsd,len
15,F9_00_HD_TAX_PER_END,"[ReturnHeader.TaxPeriodEndDt, ReturnHeader.TaxPeriodEndDate]",DateType,2
16,F9_00_HD_TAX_YEAR,"[ReturnHeader.TaxYear, ReturnHeader.TaxYr]",YearType,2


In [20]:
#df.sample(1)

Unnamed: 0,AccountantCompileOrReview,ActivityOrMissionDescription,AddressChange,AllAffiliatesIncluded,AllOtherContributions,AmendedReturn,AnnualDisclosureCoveredPersons,AuditCommittee,BenefitsPaidToMembersCY,BenefitsPaidToMembersPriorYear,CashNonInterestBearing,ChangesToOrganizingDocs,CntrbtnsRprtdFundraisingEvents,CompCurrentOfficersDirectors,CompDisqualPersons,CompensationFromOtherSources,CompensationProcessCEO,CompensationProcessOther,ConflictOfInterestPolicy,ContributionsGrantsCurrentYear,ContributionsGrantsPriorYear,CostOfGoodsSold,CountryLegalDomicile,DLN,DecisionsSubjectToApproval,DelegationOfManagementDuties,DoNotFollowSFAS117,DocumentRetentionPolicy,EIN,ElectionOfBoardMembers,FSAudited,FamilyOrBusinessRelationship,FederalGrantAuditPerformed,FederalGrantAuditRequired,FederatedCampaigns,FeesForServicesAccounting,FeesForServicesInvstMgmntFees,FeesForServicesLegal,FeesForServicesLobbying,FeesForServicesManagement,FeesForServicesOther,F9_09_PC_FEES_FOR_SVCE_FR_TOT,FollowSFAS117,Form990ProvidedToGoverningBody,FormersListed,FundraisingDirectExpenses,FundraisingEvents,GamingDirectExpenses,GovernmentGrants,GrantsAndSimilarAmntsCY,GrantsAndSimilarAmntsPriorYear,GrossIncomeFundraisingEvents,GrossIncomeGaming,GrossReceipts,GrossSalesOfInventory,GroupExemptionNumber,GroupReturnForAffiliates,InitialReturn,InvestmentInJointVenture,InvestmentIncomeCurrentYear,InvestmentIncomePriorYear,LandBldgEquipmentAccumDeprec,LandBuildingsEquipmentBasis,LoansFromOfficersDirectors,LocalChapters,MaterialDiversionOrMisuse,MembersOrStockholders,MembershipDues,MethodOfAccountingAccrual,MethodOfAccountingCash,MethodOfAccountingOther,MinutesOfCommittees,MinutesOfGoverningBody,MortNotesPyblSecuredInvestProp,NameOfPrincipalOfficerPerson,NbrIndependentVotingMembers,NbrVotingGoverningBodyMembers,NbrVotingMembersGoverningBody,NetAssetsOrFundBalancesBOY,NetAssetsOrFundBalancesEOY,NetUnrelatedBusinessTxblIncome,NoListedPersonsCompensated,NoncashContributions,NumberIndependentVotingMembers,NumberIndividualsGT100K,NumberOfContractorsGT100K,OfficerMailingAddress,Organization4947a1,Organization501c,Organization501c3,OrganizationName,OtherEmployeeBenefits,OtherExpensePriorYear,OtherExpensesCurrentYear,OtherLiabilities,OtherRevenueCurrentYear,OtherRevenuePriorYear,OtherSalariesAndWages,OtherWebsite,OwnWebsite,PayrollTaxes,PensionPlanContributions,PoliciesReferenceChapters,ProgramServiceRevenueCY,ProgramServiceRevenuePriorYear,ReconcilationRevenueExpenses,RegularMonitoringEnforcement,RelatedOrganizations,RetainedEarningsEndowmentEtc,ReturnHeader,RevenuesLessExpensesCY,RevenuesLessExpensesPriorYear,SalariesEtcCurrentYear,SalariesEtcPriorYear,SavingsAndTempCashInvestments,SpecialConditionDescription,StateLegalDomicile,StatesWhereCopyOfReturnIsFiled,TaxExemptBondLiabilities,TaxPeriod,TerminatedReturn,TerminationOrContraction,TotalAssets,TotalAssetsBOY,TotalAssetsEOY,TotalCompGT150K,TotalContributions,TotalExpensesCurrentYear,TotalExpensesPriorYear,TotalFunctionalExpenses,TotalFundrsngExpCurrentYear,TotalGrossUBI,TotalLiabilitiesBOY,TotalLiabilitiesEOY,TotalNbrEmployees,TotalNbrVolunteers,TotalOtherCompensation,TotalOtherRevenue,TotalProfFundrsngExpCY,TotalProfFundrsngExpPriorYear,TotalProgramServiceRevenue,TotalReportableCompFrmRltdOrgs,TotalReportableCompFromOrg,TotalRevenue,TotalRevenueCurrentYear,TotalRevenuePriorYear,TypeOfOrgOtherDescription,TypeOfOrganizationAssociation,TypeOfOrganizationCorporation,TypeOfOrganizationOther,TypeOfOrganizationTrust,URL,UnsecuredNotesLoansPayable,UponRequest,WebSite,WhistleblowerPolicy,WrittenPolicyOrProcedure,YearFormation,ReconcilationDonatedServices,ReconcilationInvestExpenses,ReconcilationPriorAdjustment,ReconciliationUnrealizedInvest,AccountantCompileOrReviewInd,ActivityOrMissionDesc,AddressChangeInd,AllAffiliatesIncludedInd,AllOtherContributionsAmt,AmendedReturnInd,AnnualDisclosureCoveredPrsnInd,AuditCommitteeInd,CYBenefitsPaidToMembersAmt,CYContributionsGrantsAmt,CYGrantsAndSimilarPaidAmt,CYInvestmentIncomeAmt,CYOtherExpensesAmt,CYOtherRevenueAmt,CYProgramServiceRevenueAmt,CYRevenuesLessExpensesAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalExpensesAmt,CYTotalFundraisingExpenseAmt,CYTotalProfFndrsngExpnsAmt,CYTotalRevenueAmt,CashNonInterestBearingGrp,ChangeToOrgDocumentsInd,CntrctRcvdGreaterThan100KCnt,CompCurrentOfcrDirectorsGrp,CompDisqualPersonsGrp,CompensationFromOtherSrcsInd,CompensationProcessCEOInd,CompensationProcessOtherInd,ConflictOfInterestPolicyInd,ContractTerminationInd,ContriRptFundraisingEventAmt,CostOfGoodsSoldAmt,DecisionsSubjectToApprovaInd,DelegationOfMgmtDutiesInd,DocumentRetentionPolicyInd,DonatedServicesAndUseFcltsAmt,ElectionOfBoardMembersInd,FSAuditedInd,FamilyOrBusinessRlnInd,FederalGrantAuditPerformedInd,FederalGrantAuditRequiredInd,FederatedCampaignsAmt,FeesForServicesAccountingGrp,FeesForServicesLegalGrp,FeesForServicesLobbyingGrp,FeesForServicesManagementGrp,FeesForServicesOtherGrp,FeesForSrvcInvstMgmntFeesGrp,FinalReturnInd,Form990ProvidedToGvrnBodyInd,FormationYr,FormerOfcrEmployeesListedInd,FundraisingAmt,FundraisingDirectExpensesAmt,FundraisingGrossIncomeAmt,GamingDirectExpensesAmt,GamingGrossIncomeAmt,GoverningBodyVotingMembersCnt,GovernmentGrantsAmt,GrossReceiptsAmt,GrossSalesOfInventoryAmt,GroupExemptionNum,GroupReturnForAffiliatesInd,IndependentVotingMemberCnt,IndivRcvdGreaterThan100KCnt,InitialReturnInd,InvestmentExpenseAmt,InvestmentInJointVentureInd,LandBldgEquipAccumDeprecAmt,LandBldgEquipCostOrOtherBssAmt,LegalDomicileCountryCd,LegalDomicileStateCd,LoansFromOfficersDirectorsGrp,LocalChaptersInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,MembershipDuesAmt,MethodOfAccountingAccrualInd,MethodOfAccountingCashInd,MethodOfAccountingOtherInd,MinutesOfCommitteesInd,MinutesOfGoverningBodyInd,MortgNotesPyblScrdInvstPropGrp,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,NetUnrelatedBusTxblIncmAmt,NetUnrlzdGainsLossesInvstAmt,NoListedPersonsCompensatedInd,NoncashContributionsAmt,OfficerMailingAddressInd,OrgDoesNotFollowSFAS117Ind,Organization4947a1NotPFInd,Organization501c3Ind,Organization501cInd,OrganizationFollowsSFAS117Ind,OtherEmployeeBenefitsGrp,OtherLiabilitiesGrp,OtherOrganizationDsc,OtherRevenueTotalAmt,OtherSalariesAndWagesGrp,OtherWebsiteInd,OwnWebsiteInd,PYBenefitsPaidToMembersAmt,PYContributionsGrantsAmt,PYGrantsAndSimilarPaidAmt,PYInvestmentIncomeAmt,PYOtherExpensesAmt,PYOtherRevenueAmt,PYProgramServiceRevenueAmt,PYRevenuesLessExpensesAmt,PYSalariesCompEmpBnftPaidAmt,PYTotalExpensesAmt,PYTotalProfFndrsngExpnsAmt,PYTotalRevenueAmt,PayrollTaxesGrp,PensionPlanContributionsGrp,PoliciesReferenceChaptersInd,PrincipalOfficerNm,PriorPeriodAdjustmentsAmt,ReconcilationRevenueExpnssAmt,RegularMonitoringEnfrcInd,RelatedOrganizationsAmt,RtnEarnEndowmentIncmOthFndsGrp,SavingsAndTempCashInvstGrp,SpecialConditionDesc,StatesWhereCopyOfReturnIsFldCd,TaxExemptBondLiabilitiesGrp,TotReportableCompRltdOrgAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalAssetsGrp,TotalCompGreaterThan150KInd,TotalContributionsAmt,TotalEmployeeCnt,TotalFunctionalExpensesGrp,TotalGrossUBIAmt,TotalLiabilitiesBOYAmt,TotalLiabilitiesEOYAmt,TotalOtherCompensationAmt,TotalProgramServiceRevenueAmt,TotalReportableCompFromOrgAmt,TotalRevenueGrp,TotalVolunteersCnt,TypeOfOrganizationAssocInd,TypeOfOrganizationCorpInd,TypeOfOrganizationOtherInd,TypeOfOrganizationTrustInd,UnsecuredNotesLoansPayableGrp,UponRequestInd,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,WebsiteAddressTxt,WhistleblowerPolicyInd,WrittenPolicyOrProcedureInd,fiscal_year
3393,,,,,,,,,,,,,,,,,,,,,,,,93493267001399,,,,,930846286,,,,,,,,,,,,,{'TotalAmt': '0'},,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,ST PAUL PAROCHIAL SCHOOL ENDOWMENT FUND,,,,,,,,,,,,,,,,,,,"{'TaxPeriodEndDt': '2019-06-30', 'TaxYr': '2018'}",,,,,,,,,,201906,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,https://s3.amazonaws.com/irs-form-990/201942679349300139_public.xml,,,,,,,,,,,False,SUPPORTING ORGANIZATION FOR ST. PAUL PAROCHIAL SCHOOL.,,False,14140,,False,,0,14140,24403,33397,9690,0,2450,15894,0,34093,0,0,49987,"{'BOYAmt': '107', 'EOYAmt': '284'}",False,0,{'TotalAmt': '0'},{'TotalAmt': '0'},False,False,False,False,,,,False,False,False,,False,False,False,,False,,"{'TotalAmt': '1345', 'ProgramServicesAmt': '1345'}",{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},{'TotalAmt': '0'},"{'TotalAmt': '5838', 'ProgramServicesAmt': '5838'}",,False,,False,,,,,,13,,232935,,,False,0,0,,,False,,,,,,False,False,False,,,X,,False,False,,578679,594573,,,X,,False,,,X,,X,{'TotalAmt': '0'},,,0,{'TotalAmt': '0'},,,,24634,27521,28259,9430,,2775,18717,,36951,,55668,{'TotalAmt': '0'},{'TotalAmt': '0'},,,,15894,False,,,"{'BOYAmt': '83504', 'EOYAmt': '43131'}",,,,,578679,594573,"{'BOYAmt': '578679', 'EOYAmt': '594573'}",False,14140,0,"{'TotalAmt': '34093', 'ProgramServicesAmt': '34093', 'ManagementAndGeneralAmt': '0', 'FundraisingAmt': '0'}",0,,0,,2450,,"{'TotalRevenueColumnAmt': '49987', 'RelatedOrExemptFuncIncomeAmt': '7849', 'ExclusionAmt': '27998'}",,,,,,,,13,0,,False,,2019


<br>*ast* only needed if reading in CSV file

In [21]:
#from ast import literal_eval
#import ast
#for index, row in df[:2].iterrows():
#    print(row['ReturnHeader'])
#    print(type(row['ReturnHeader']))
#    #USE FOLLOWING CODE IF I AM IMPORTING CSV
#    #print(type(ast.literal_eval(row['ReturnHeader']))) 
#    #print(ast.literal_eval(row['ReturnHeader'])['Total'], '\n')

{'TaxPeriodEndDate': '2010-12-31', 'TaxYear': '2010'}
<class 'dict'>
{'TaxPeriodEndDate': '2011-06-30', 'TaxYear': '2010'}
<class 'dict'>


#### Extended 'lambda' function
https://stackoverflow.com/questions/48872234/using-apply-in-pandas-lambda-functions-with-multiple-if-statements?noredirect=1&lq=1

##### Version if reading in *.csv file

In [85]:
"""
def func(x, key1, key2):
    if pd.isnull(x):
        return np.nan
    #else: 
    #    mydict = ast.literal_eval(x)
    elif key1 in ast.literal_eval(x).keys():
        return ast.literal_eval(x)[key1]
    elif key2 in ast.literal_eval(x).keys():
        return ast.literal_eval(x)[key2]
    else:
        return np.nan
"""

##### Version if reading in *.pkl file

In [22]:
"""def func(x, key1, key2):
    if pd.isnull(x):
        return np.nan
    #else: 
    #    mydict = ast.literal_eval(x)
    elif key1 in x.keys():
        return x[key1]
    elif key2 in x.keys():
        return x[key2]
    else:
        return np.nan
"""

In [23]:
#import timeit
#start_time = timeit.default_timer()
#df['F9_00_HD_TAX_PER_END'] = df['ReturnHeader'][:].apply(func, key1='TaxPeriodEndDt', key2='TaxPeriodEndDate')
##df['F9_09_PC_FEES_FOR_SVCE_MGMT_TOT2'] = df['ReturnHeader'].astype('float')
#elapsed = timeit.default_timer() - start_time
#print('# of minutes: ', elapsed/60) 

# of minutes:  0.2046476249999993


In [24]:
#import timeit
#start_time = timeit.default_timer()
#df['F9_00_HD_TAX_YEAR'] = df['ReturnHeader'][:].apply(func, key1='TaxYr', key2='TaxYear')
##df['F9_09_PC_FEES_FOR_SVCE_MGMT_TOT2'] = df['ReturnHeader'].astype('float')
#elapsed = timeit.default_timer() - start_time
#print('# of minutes: ', elapsed/60) 

# of minutes:  0.19795467166666944


### Handle variables with only 1 original name
NOTE:
- Per *IRS 990 e-File Data -- Control Variables (4) -- Fees-for-Services Variables  - Extract from MongoDB and Process -- Part I (Python 3.6).ipynb*, it looks like there is no *FeesForServicesProfFundraisingGrp*
    - Instead, as seen in the concordance file, *FeesForServicesProfFundraising* has both a 'Total' and a 'TotalAmt' key, which suggests this is the only key that did not change names

In [73]:
new_variables_df[new_variables_df['len']!=2]

Unnamed: 0,variable_name_new,original_names,data_type_xsd,len
2,F9_00_HD_BUILD_TIME_STAMP,[BuildTS],,1
7,F9_00_HD_FILER_STATE_US,[Filer],StateType,1
137,F9_09_PC_FEES_FOR_SVCE_FR_TOT,[FeesForServicesProfFundraising],,1
192,TaxPeriod,[TaxPeriod],,1


#### Rename *FeesForServicesProfFundraising*
Note that *describe* and *value_counts* won't work yet because some values are dictionaries

In [74]:
%%time
df.rename(columns = {'FeesForServicesProfFundraising':'F9_09_PC_FEES_FOR_SVCE_FR_TOT'}, inplace = True)
#df['F9_09_PC_FEES_FOR_SVCE_FR_TOT'].describe()
#df['F9_09_PC_FEES_FOR_SVCE_FR_TOT'].value_counts()[:5]

Wall time: 27.6 s


#### Rename *BuildTS*

In [75]:
%%time
df.rename(columns = {'BuildTS':'F9_00_HD_BUILD_TIME_STAMP'}, inplace = True)

Wall time: 11 s


In [82]:
df[['F9_09_PC_FEES_FOR_SVCE_FR_TOT', 'F9_00_HD_BUILD_TIME_STAMP', 'Filer', 'TaxPeriod']].sample(5)

Unnamed: 0,F9_09_PC_FEES_FOR_SVCE_FR_TOT,F9_00_HD_BUILD_TIME_STAMP,Filer,TaxPeriod
1017228,{'TotalAmt': '0'},2017-02-10 21:41:12Z,"{'EIN': '742158707', 'BusinessName': {'BusinessNameLine1Txt': 'NATIONAL ANIMAL CARE & CONTROL', 'BusinessNameLine2Txt': 'ASSOCIATION'}, 'BusinessNameControlTxt': 'NATI', 'PhoneNum': '9137681319', 'USAddress': {'AddressLine1Txt': '515 RUSSELL AVE'...",201512
1031402,"{'TotalAmt': '31250', 'FundraisingAmt': '31250'}",2017-02-10 21:41:12Z,"{'EIN': '030342594', 'BusinessName': {'BusinessNameLine1Txt': 'LAKE CHAMPLAIN COMMUNITY SAILING', 'BusinessNameLine2Txt': 'CENTER INC'}, 'BusinessNameControlTxt': 'LAKE', 'PhoneNum': '8028642499', 'USAddress': {'AddressLine1Txt': 'PO BOX 64818', ...",201512
382975,,2016-02-25 16:41:14Z,"{'EIN': '203345087', 'Name': {'BusinessNameLine1': 'LIONS EYE BANK OF NEW JERSEY INC'}, 'NameControl': 'LION', 'Phone': '7323823060', 'USAddress': {'AddressLine1': '77 BRANT AVE', 'AddressLine2': 'ROOM/SUITE 100', 'City': 'CLARK', 'State': 'NJ', ...",201206
492695,"{'TotalAmt': '44056', 'FundraisingAmt': '44056'}",2015-11-30 17:44:51Z,"{'EIN': '581735540', 'BusinessName': {'BusinessNameLine1': 'WICHITA HABITAT FOR HUMANITY INC'}, 'BusinessNameControlTxt': 'WICH', 'PhoneNum': '3162690755', 'USAddress': {'AddressLine1': '130 E MURDOCK NO 102', 'City': 'WICHITA', 'State': 'KS', 'Z...",201312
901684,,2016-09-27 15:27:22Z,"{'EIN': '232028533', 'BusinessName': {'BusinessNameLine1Txt': 'COVINGTON CHURCH OF CHRIST'}, 'BusinessNameControlTxt': 'COVI', 'PhoneNum': '5706595629', 'USAddress': {'AddressLine1Txt': '2225 N WILLIAMSON ROAD PO BOX 185', 'CityNm': 'COVINGTON', ...",201512


In [86]:
#years = pd.DataFrame(df['F9_00_HD_TAX_YEAR'].value_counts())
years = pd.DataFrame(df['fiscal_year'].value_counts())
years.index.name = 'year'
years = years.reset_index()
years = years.sort_values('year')
years

Unnamed: 0,year,fiscal_year
9,2010,98185
7,2011,139300
6,2012,170761
5,2013,190710
4,2014,210538
3,2015,228000
2,2016,240291
0,2017,251118
1,2018,250237
8,2019,113354


<br>
NOTE: I am not dropping *fiscal_year* yet -- it is different from *F9_00_HD_TAX_YEAR* (*fiscal_year* is the 'year' part of the variable *TaxPeriod*, which is the year-month of *F9_00_HD_TAX_PER_END*

So, the first three below -- *fiscal_year*,  *TaxPeriod*, and *F9_00_HD_TAX_PER_END* -- are all based off the same date, which is the *END* of the tax period, while *F9_00_HD_TAX_YEAR* reflects the year in which the tax period *BEGINS*.

In [213]:
df[['fiscal_year',  'TaxPeriod', 'F9_00_HD_TAX_PER_END', 'F9_00_HD_TAX_YEAR']].sample(25)

Unnamed: 0,fiscal_year,TaxPeriod,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR
1858761,2019,201906,2019-06-30,2018
1256596,2016,201612,2016-12-31,2016
226064,2012,201203,2012-03-31,2011
1732876,2019,201906,2019-06-30,2018
740754,2014,201406,2014-06-30,2013
1437744,2017,201712,2017-12-31,2017
181793,2011,201112,2011-12-31,2011
718120,2014,201412,2014-12-31,2014
1735409,2018,201812,2018-12-31,2018
248050,2011,201112,2011-12-31,2011


In [28]:
#df = df.drop('fiscal_year', 1)

In [205]:
df[['fiscal_year', 'F9_00_HD_TAX_YEAR']].describe().T

Unnamed: 0,count,unique,top,freq
fiscal_year,1895015,12,2017,251118
F9_00_HD_TAX_YEAR,1895016,11,2017,252085


In [209]:
[c for c in df.columns.tolist() if 'tax' in c.lower()]

['TaxPeriod',
 'F9_00_HD_TAX_PER_END',
 'F9_00_HD_TAX_YEAR',
 'F9_09_PC_PAYROLL_TAX_FUNDRAISE',
 'F9_09_PC_PAYROLL_TAX_MGMT',
 'F9_09_PC_PAYROLL_TAX_PROG_SVCE',
 'F9_09_PC_PAYROLL_TAX_TOTAL']

<br>Drop *ReturnHeader*

In [29]:
#df = df.drop('ReturnHeader', 1)

<br>Drop the two variables from *new_variables_df*

In [18]:
#print(len(new_variables_df))
#new_variables_df = new_variables_df[~new_variables_df['variable_name_new'].isin(['F9_00_HD_TAX_PER_END', 'F9_00_HD_TAX_YEAR'])]
#print(len(new_variables_df))

185
183


In [87]:
df[:1]

Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,TypeOfOrganizationCorporation,YearFormation,StateLegalDomicile,ActivityOrMissionDescription,NbrVotingMembersGoverningBody,NbrIndependentVotingMembers,TotalNbrEmployees,TotalNbrVolunteers,TotalGrossUBI,NetUnrelatedBusinessTxblIncome,ContributionsGrantsPriorYear,ContributionsGrantsCurrentYear,ProgramServiceRevenuePriorYear,ProgramServiceRevenueCY,InvestmentIncomePriorYear,InvestmentIncomeCurrentYear,OtherRevenuePriorYear,OtherRevenueCurrentYear,TotalRevenuePriorYear,TotalRevenueCurrentYear,GrantsAndSimilarAmntsPriorYear,GrantsAndSimilarAmntsCY,BenefitsPaidToMembersPriorYear,BenefitsPaidToMembersCY,SalariesEtcPriorYear,SalariesEtcCurrentYear,TotalProfFundrsngExpPriorYear,TotalProfFundrsngExpCY,TotalFundrsngExpCurrentYear,OtherExpensePriorYear,OtherExpensesCurrentYear,TotalExpensesPriorYear,TotalExpensesCurrentYear,RevenuesLessExpensesPriorYear,RevenuesLessExpensesCY,TotalAssetsBOY,TotalAssetsEOY,TotalLiabilitiesBOY,TotalLiabilitiesEOY,NetAssetsOrFundBalancesBOY,NetAssetsOrFundBalancesEOY,ProfessionalFundraising,FundraisingActivities,Gaming,NbrVotingGoverningBodyMembers,NumberIndependentVotingMembers,FamilyOrBusinessRelationship,DelegationOfManagementDuties,ChangesToOrganizingDocs,MaterialDiversionOrMisuse,MembersOrStockholders,ElectionOfBoardMembers,DecisionsSubjectToApproval,MinutesOfGoverningBody,MinutesOfCommittees,OfficerMailingAddress,LocalChapters,Form990ProvidedToGoverningBody,ConflictOfInterestPolicy,AnnualDisclosureCoveredPersons,RegularMonitoringEnforcement,WhistleblowerPolicy,DocumentRetentionPolicy,CompensationProcessCEO,CompensationProcessOther,InvestmentInJointVenture,StatesWhereCopyOfReturnIsFiled,UponRequest,NoListedPersonsCompensated,TotalReportableCompFromOrg,TotalReportableCompFrmRltdOrgs,TotalOtherCompensation,NumberIndividualsGT100K,FormersListed,TotalCompGT150K,CompensationFromOtherSources,NumberOfContractorsGT100K,AllOtherContributions,TotalContributions,TotalOtherRevenue,TotalRevenue,FeesForServicesLegal,FeesForServicesAccounting,TotalFunctionalExpenses,SavingsAndTempCashInvestments,LandBuildingsEquipmentBasis,LandBldgEquipmentAccumDeprec,TotalAssets,OtherLiabilities,FollowSFAS117,ReconcilationRevenueExpenses,MethodOfAccountingAccrual,AccountantCompileOrReview,FSAudited,AuditCommittee,FederalGrantAuditRequired,AllAffiliatesIncluded,GroupExemptionNumber,PoliciesReferenceChapters,WrittenPolicyOrProcedure,TotalProgramServiceRevenue,CompCurrentOfficersDirectors,CompDisqualPersons,OtherSalariesAndWages,PensionPlanContributions,OtherEmployeeBenefits,PayrollTaxes,FeesForServicesManagement,FeesForServicesLobbying,F9_09_PC_FEES_FOR_SVCE_FR_TOT,FeesForServicesInvstMgmntFees,FeesForServicesOther,CashNonInterestBearing,MortNotesPyblSecuredInvestProp,FederalGrantAuditPerformed,LoansFromOfficersDirectors,MethodOfAccountingCash,TaxExemptBondLiabilities,OtherWebsite,FundraisingEvents,CntrbtnsRprtdFundraisingEvents,RelatedOrganizations,GrossIncomeFundraisingEvents,FundraisingDirectExpenses,FederatedCampaigns,GovernmentGrants,MethodOfAccountingOther,GrossSalesOfInventory,CostOfGoodsSold,DoNotFollowSFAS117,RetainedEarningsEndowmentEtc,InitialReturn,MembershipDues,GrossIncomeGaming,GamingDirectExpenses,NoncashContributions,OwnWebsite,UnsecuredNotesLoansPayable,TypeOfOrganizationOther,Organization501c,TypeOfOrganizationTrust,TypeOfOrganizationAssociation,CountryLegalDomicile,AmendedReturn,TypeOfOrgOtherDescription,TerminatedReturn,TerminationOrContraction,SpecialConditionDescription,Organization4947a1,ReconciliationUnrealizedInvest,ReconcilationPriorAdjustment,ReconcilationDonatedServices,ReconcilationInvestExpenses,PrincipalOfficerNm,GrossReceiptsAmt,GroupReturnForAffiliatesInd,Organization501c3Ind,TypeOfOrganizationCorpInd,FormationYr,LegalDomicileStateCd,ActivityOrMissionDesc,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,TotalEmployeeCnt,TotalGrossUBIAmt,CYContributionsGrantsAmt,CYProgramServiceRevenueAmt,CYInvestmentIncomeAmt,CYOtherRevenueAmt,CYTotalRevenueAmt,CYGrantsAndSimilarPaidAmt,CYBenefitsPaidToMembersAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalProfFndrsngExpnsAmt,CYTotalFundraisingExpenseAmt,CYOtherExpensesAmt,CYTotalExpensesAmt,CYRevenuesLessExpensesAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalLiabilitiesEOYAmt,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,ProfessionalFundraisingInd,FundraisingActivitiesInd,GamingActivitiesInd,GoverningBodyVotingMembersCnt,IndependentVotingMemberCnt,FamilyOrBusinessRlnInd,DelegationOfMgmtDutiesInd,ChangeToOrgDocumentsInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,ElectionOfBoardMembersInd,DecisionsSubjectToApprovaInd,MinutesOfGoverningBodyInd,MinutesOfCommitteesInd,OfficerMailingAddressInd,LocalChaptersInd,Form990ProvidedToGvrnBodyInd,ConflictOfInterestPolicyInd,WhistleblowerPolicyInd,DocumentRetentionPolicyInd,CompensationProcessCEOInd,CompensationProcessOtherInd,InvestmentInJointVentureInd,StatesWhereCopyOfReturnIsFldCd,NoListedPersonsCompensatedInd,FormerOfcrEmployeesListedInd,TotalCompGreaterThan150KInd,CompensationFromOtherSrcsInd,MembershipDuesAmt,FundraisingAmt,AllOtherContributionsAmt,TotalContributionsAmt,OtherRevenueTotalAmt,TotalRevenueGrp,FeesForServicesAccountingGrp,TotalFunctionalExpensesGrp,CashNonInterestBearingGrp,TotalAssetsGrp,OrgDoesNotFollowSFAS117Ind,RtnEarnEndowmentIncmOthFndsGrp,ReconcilationRevenueExpnssAmt,MethodOfAccountingCashInd,AccountantCompileOrReviewInd,FSAuditedInd,FederalGrantAuditRequiredInd,WebsiteAddressTxt,TotalVolunteersCnt,NetUnrelatedBusTxblIncmAmt,PYContributionsGrantsAmt,PYProgramServiceRevenueAmt,PYInvestmentIncomeAmt,PYOtherRevenueAmt,PYTotalRevenueAmt,PYGrantsAndSimilarPaidAmt,PYBenefitsPaidToMembersAmt,PYSalariesCompEmpBnftPaidAmt,PYTotalProfFndrsngExpnsAmt,PYOtherExpensesAmt,PYTotalExpensesAmt,PYRevenuesLessExpensesAmt,TotalLiabilitiesBOYAmt,AnnualDisclosureCoveredPrsnInd,RegularMonitoringEnfrcInd,UponRequestInd,TotalReportableCompFromOrgAmt,TotReportableCompRltdOrgAmt,TotalOtherCompensationAmt,IndivRcvdGreaterThan100KCnt,CntrctRcvdGreaterThan100KCnt,GovernmentGrantsAmt,TotalProgramServiceRevenueAmt,FundraisingGrossIncomeAmt,ContriRptFundraisingEventAmt,FundraisingDirectExpensesAmt,GrossSalesOfInventoryAmt,CostOfGoodsSoldAmt,CompCurrentOfcrDirectorsGrp,OtherSalariesAndWagesGrp,PensionPlanContributionsGrp,OtherEmployeeBenefitsGrp,PayrollTaxesGrp,FeesForServicesOtherGrp,SavingsAndTempCashInvstGrp,LandBldgEquipCostOrOtherBssAmt,LandBldgEquipAccumDeprecAmt,MortgNotesPyblScrdInvstPropGrp,OtherLiabilitiesGrp,OrganizationFollowsSFAS117Ind,NetUnrlzdGainsLossesInvstAmt,AuditCommitteeInd,AllAffiliatesIncludedInd,CompDisqualPersonsGrp,FeesForServicesManagementGrp,FeesForServicesLegalGrp,FeesForServicesLobbyingGrp,FeesForSrvcInvstMgmntFeesGrp,MethodOfAccountingAccrualInd,NoncashContributionsAmt,TaxExemptBondLiabilitiesGrp,LoansFromOfficersDirectorsGrp,UnsecuredNotesLoansPayableGrp,PriorPeriodAdjustmentsAmt,FederalGrantAuditPerformedInd,PoliciesReferenceChaptersInd,OtherWebsiteInd,AddressChangeInd,WrittenPolicyOrProcedureInd,RelatedOrganizationsAmt,OwnWebsiteInd,DonatedServicesAndUseFcltsAmt,LegalDomicileCountryCd,TypeOfOrganizationTrustInd,FinalReturnInd,ContractTerminationInd,GroupExemptionNum,FederatedCampaignsAmt,TypeOfOrganizationOtherInd,OtherOrganizationDsc,TypeOfOrganizationAssocInd,InitialReturnInd,GamingGrossIncomeAmt,GamingDirectExpensesAmt,MethodOfAccountingOtherInd,InvestmentExpenseAmt,Organization501cInd,Organization4947a1NotPFInd,AmendedReturnInd,SpecialConditionDesc,fiscal_year,Timestamp,TaxPeriodEndDate,Filer,Officer,TaxYear,F9_00_HD_BUILD_TIME_STAMP,ReturnTs,TaxPeriodEndDt,BusinessOfficerGrp,TaxYr
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,X,MICHAEL ANTON,1473903,0,X,,X,1992,PA,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,10,10,0,0,0,0,1044925,1439340,0,0,30447,33563,0,1000,1075372,1473903,638637,925000,0,0,0,0,0,0,195892,243131,459751,881768,1384751,193604,89152,1925215,2440859,171810,450430,1753405,1990429,0,0,0,10,10,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,0,0,"[PA, NJ, DE]",X,X,0,0,0,0,0,0,0,0,1439340,1439340,1000,"{'TotalRevenueColumn': '1473903', 'RelatedOrExemptFunctionIncome': '1000', 'UnrelatedBusinessRevenue': '0', 'ExclusionAmount': '33563'}","{'Total': '215', 'ManagementAndGeneral': '215'}","{'Total': '21675', 'ManagementAndGeneral': '21675'}","{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}","{'BOY': '332660', 'EOY': '270700'}",256845,86228,"{'BOY': '1925215', 'EOY': '2440859'}","{'BOY': '51640', 'EOY': '240077'}",X,89152,X,0,1,1,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2010,2011-11-09T06:41:09-06:00,2010-12-31,"{'EIN': '232705170', 'Name': {'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}, 'NameControl': 'RONA', 'Phone': '8565826843', 'USAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300...","{'Name': 'ROBERT TRAA', 'Title': 'TREASURER', 'Phone': '8565826843', 'DateSigned': '2011-11-04', 'AuthorizeThirdParty': '1'}",2010,2016-02-24 21:20:13Z,,,,


##### Save DF

In [88]:
#import timeit
#start_time = timeit.default_timer()
#df.to_pickle('all filings - with 185 newly named control variables.pkl')
#elapsed = timeit.default_timer() - start_time
#print('# of minutes: ', elapsed/60) 

In [12]:
"""
import timeit
start_time = timeit.default_timer()
df = pd.read_pickle('all filings - with 185 newly named control variables.pkl')
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 
print('# of columns:', len(df.columns))
print('# of observations:', len(df))
df[:2]
"""

# of minutes:  1.9289162200000003
# of columns: 318
# of observations: 1727056


Unnamed: 0,AccountantCompileOrReview,ActivityOrMissionDescription,AddressChange,AllAffiliatesIncluded,AllOtherContributions,AmendedReturn,AnnualDisclosureCoveredPersons,AuditCommittee,BenefitsPaidToMembersCY,BenefitsPaidToMembersPriorYear,CashNonInterestBearing,ChangesToOrganizingDocs,CntrbtnsRprtdFundraisingEvents,CompCurrentOfficersDirectors,CompDisqualPersons,CompensationFromOtherSources,CompensationProcessCEO,CompensationProcessOther,ConflictOfInterestPolicy,ContributionsGrantsCurrentYear,ContributionsGrantsPriorYear,CostOfGoodsSold,CountryLegalDomicile,DLN,DecisionsSubjectToApproval,DelegationOfManagementDuties,DoNotFollowSFAS117,DocumentRetentionPolicy,EIN,ElectionOfBoardMembers,FSAudited,FamilyOrBusinessRelationship,FederalGrantAuditPerformed,FederalGrantAuditRequired,FederatedCampaigns,FeesForServicesAccounting,FeesForServicesInvstMgmntFees,FeesForServicesLegal,FeesForServicesLobbying,FeesForServicesManagement,FeesForServicesOther,F9_09_PC_FEES_FOR_SVCE_FR_TOT,FollowSFAS117,Form990ProvidedToGoverningBody,FormersListed,FundraisingDirectExpenses,FundraisingEvents,GamingDirectExpenses,GovernmentGrants,GrantsAndSimilarAmntsCY,GrantsAndSimilarAmntsPriorYear,GrossIncomeFundraisingEvents,GrossIncomeGaming,GrossReceipts,GrossSalesOfInventory,GroupExemptionNumber,GroupReturnForAffiliates,InitialReturn,InvestmentInJointVenture,InvestmentIncomeCurrentYear,InvestmentIncomePriorYear,LandBldgEquipmentAccumDeprec,LandBuildingsEquipmentBasis,LoansFromOfficersDirectors,LocalChapters,MaterialDiversionOrMisuse,MembersOrStockholders,MembershipDues,MethodOfAccountingAccrual,MethodOfAccountingCash,MethodOfAccountingOther,MinutesOfCommittees,MinutesOfGoverningBody,MortNotesPyblSecuredInvestProp,NameOfPrincipalOfficerPerson,NbrIndependentVotingMembers,NbrVotingGoverningBodyMembers,NbrVotingMembersGoverningBody,NetAssetsOrFundBalancesBOY,NetAssetsOrFundBalancesEOY,NetUnrelatedBusinessTxblIncome,NoListedPersonsCompensated,NoncashContributions,NumberIndependentVotingMembers,NumberIndividualsGT100K,NumberOfContractorsGT100K,OfficerMailingAddress,Organization4947a1,Organization501c,Organization501c3,OrganizationName,OtherEmployeeBenefits,OtherExpensePriorYear,OtherExpensesCurrentYear,OtherLiabilities,OtherRevenueCurrentYear,OtherRevenuePriorYear,OtherSalariesAndWages,OtherWebsite,OwnWebsite,PayrollTaxes,PensionPlanContributions,PoliciesReferenceChapters,ProgramServiceRevenueCY,ProgramServiceRevenuePriorYear,ReconcilationRevenueExpenses,RegularMonitoringEnforcement,RelatedOrganizations,RetainedEarningsEndowmentEtc,RevenuesLessExpensesCY,RevenuesLessExpensesPriorYear,SalariesEtcCurrentYear,SalariesEtcPriorYear,SavingsAndTempCashInvestments,SpecialConditionDescription,StateLegalDomicile,StatesWhereCopyOfReturnIsFiled,TaxExemptBondLiabilities,TaxPeriod,TerminatedReturn,TerminationOrContraction,TotalAssets,TotalAssetsBOY,TotalAssetsEOY,TotalCompGT150K,TotalContributions,TotalExpensesCurrentYear,TotalExpensesPriorYear,TotalFunctionalExpenses,TotalFundrsngExpCurrentYear,TotalGrossUBI,TotalLiabilitiesBOY,TotalLiabilitiesEOY,TotalNbrEmployees,TotalNbrVolunteers,TotalOtherCompensation,TotalOtherRevenue,TotalProfFundrsngExpCY,TotalProfFundrsngExpPriorYear,TotalProgramServiceRevenue,TotalReportableCompFrmRltdOrgs,TotalReportableCompFromOrg,TotalRevenue,TotalRevenueCurrentYear,TotalRevenuePriorYear,TypeOfOrgOtherDescription,TypeOfOrganizationAssociation,TypeOfOrganizationCorporation,TypeOfOrganizationOther,TypeOfOrganizationTrust,URL,UnsecuredNotesLoansPayable,UponRequest,WebSite,WhistleblowerPolicy,WrittenPolicyOrProcedure,YearFormation,ReconcilationDonatedServices,ReconcilationInvestExpenses,ReconcilationPriorAdjustment,ReconciliationUnrealizedInvest,AccountantCompileOrReviewInd,ActivityOrMissionDesc,AddressChangeInd,AllAffiliatesIncludedInd,AllOtherContributionsAmt,AmendedReturnInd,AnnualDisclosureCoveredPrsnInd,AuditCommitteeInd,CYBenefitsPaidToMembersAmt,CYContributionsGrantsAmt,CYGrantsAndSimilarPaidAmt,CYInvestmentIncomeAmt,CYOtherExpensesAmt,CYOtherRevenueAmt,CYProgramServiceRevenueAmt,CYRevenuesLessExpensesAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalExpensesAmt,CYTotalFundraisingExpenseAmt,CYTotalProfFndrsngExpnsAmt,CYTotalRevenueAmt,CashNonInterestBearingGrp,ChangeToOrgDocumentsInd,CntrctRcvdGreaterThan100KCnt,CompCurrentOfcrDirectorsGrp,CompDisqualPersonsGrp,CompensationFromOtherSrcsInd,CompensationProcessCEOInd,CompensationProcessOtherInd,ConflictOfInterestPolicyInd,ContractTerminationInd,ContriRptFundraisingEventAmt,CostOfGoodsSoldAmt,DecisionsSubjectToApprovaInd,DelegationOfMgmtDutiesInd,DocumentRetentionPolicyInd,DonatedServicesAndUseFcltsAmt,ElectionOfBoardMembersInd,FSAuditedInd,FamilyOrBusinessRlnInd,FederalGrantAuditPerformedInd,FederalGrantAuditRequiredInd,FederatedCampaignsAmt,FeesForServicesAccountingGrp,FeesForServicesLegalGrp,FeesForServicesLobbyingGrp,FeesForServicesManagementGrp,FeesForServicesOtherGrp,FeesForSrvcInvstMgmntFeesGrp,FinalReturnInd,Form990ProvidedToGvrnBodyInd,FormationYr,FormerOfcrEmployeesListedInd,FundraisingAmt,FundraisingDirectExpensesAmt,FundraisingGrossIncomeAmt,GamingDirectExpensesAmt,GamingGrossIncomeAmt,GoverningBodyVotingMembersCnt,GovernmentGrantsAmt,GrossReceiptsAmt,GrossSalesOfInventoryAmt,GroupExemptionNum,GroupReturnForAffiliatesInd,IndependentVotingMemberCnt,IndivRcvdGreaterThan100KCnt,InitialReturnInd,InvestmentExpenseAmt,InvestmentInJointVentureInd,LandBldgEquipAccumDeprecAmt,LandBldgEquipCostOrOtherBssAmt,LegalDomicileCountryCd,LegalDomicileStateCd,LoansFromOfficersDirectorsGrp,LocalChaptersInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,MembershipDuesAmt,MethodOfAccountingAccrualInd,MethodOfAccountingCashInd,MethodOfAccountingOtherInd,MinutesOfCommitteesInd,MinutesOfGoverningBodyInd,MortgNotesPyblScrdInvstPropGrp,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,NetUnrelatedBusTxblIncmAmt,NetUnrlzdGainsLossesInvstAmt,NoListedPersonsCompensatedInd,NoncashContributionsAmt,OfficerMailingAddressInd,OrgDoesNotFollowSFAS117Ind,Organization4947a1NotPFInd,Organization501c3Ind,Organization501cInd,OrganizationFollowsSFAS117Ind,OtherEmployeeBenefitsGrp,OtherLiabilitiesGrp,OtherOrganizationDsc,OtherRevenueTotalAmt,OtherSalariesAndWagesGrp,OtherWebsiteInd,OwnWebsiteInd,PYBenefitsPaidToMembersAmt,PYContributionsGrantsAmt,PYGrantsAndSimilarPaidAmt,PYInvestmentIncomeAmt,PYOtherExpensesAmt,PYOtherRevenueAmt,PYProgramServiceRevenueAmt,PYRevenuesLessExpensesAmt,PYSalariesCompEmpBnftPaidAmt,PYTotalExpensesAmt,PYTotalProfFndrsngExpnsAmt,PYTotalRevenueAmt,PayrollTaxesGrp,PensionPlanContributionsGrp,PoliciesReferenceChaptersInd,PrincipalOfficerNm,PriorPeriodAdjustmentsAmt,ReconcilationRevenueExpnssAmt,RegularMonitoringEnfrcInd,RelatedOrganizationsAmt,RtnEarnEndowmentIncmOthFndsGrp,SavingsAndTempCashInvstGrp,SpecialConditionDesc,StatesWhereCopyOfReturnIsFldCd,TaxExemptBondLiabilitiesGrp,TotReportableCompRltdOrgAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalAssetsGrp,TotalCompGreaterThan150KInd,TotalContributionsAmt,TotalEmployeeCnt,TotalFunctionalExpensesGrp,TotalGrossUBIAmt,TotalLiabilitiesBOYAmt,TotalLiabilitiesEOYAmt,TotalOtherCompensationAmt,TotalProgramServiceRevenueAmt,TotalReportableCompFromOrgAmt,TotalRevenueGrp,TotalVolunteersCnt,TypeOfOrganizationAssocInd,TypeOfOrganizationCorpInd,TypeOfOrganizationOtherInd,TypeOfOrganizationTrustInd,UnsecuredNotesLoansPayableGrp,UponRequestInd,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,WebsiteAddressTxt,WhistleblowerPolicyInd,WrittenPolicyOrProcedureInd,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR
0,0,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,X,,1439340.0,,1,1,0,0.0,,0,,,,0,0,0,1,1439340,1044925.0,,,93493313013011,0,0,,0,232705170,0,1,0,,0,,"{'Total': '21675', 'ManagementAndGeneral': '21675'}",,"{'Total': '215', 'ManagementAndGeneral': '215'}",,,,,X,1,0,,,,,925000,638637.0,,,1473903,,,0,,0,33563,30447,86228,256845,,0,0,0,,X,,,1,1,,MICHAEL ANTON,10,10,10,1753405,1990429,0.0,X,,10,0,0,0,,,X,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,,243131,459751,"{'BOY': '51640', 'EOY': '240077'}",1000,0.0,,,,,,,0,0,89152,1,,,89152,193604,0,0,"{'BOY': '332660', 'EOY': '270700'}",,PA,"[PA, NJ, DE]",,201012,,,"{'BOY': '1925215', 'EOY': '2440859'}",1925215,2440859,0,1439340,1384751,881768,"{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}",195892,0,171810,450430,0,0.0,0,1000,0,0.0,,0,0.0,"{'TotalRevenueColumn': '1473903', 'RelatedOrExemptFunctionIncome': '1000', 'UnrelatedBusinessRevenue': '0', 'ExclusionAmount': '33563'}",1473903,1075372,,,X,,,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,,X,,0,,1992,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2010-12-31,2010
1,false,PROVIDE HOUSING FOR THE ELDERLY AND THE DISABLED UNDER SECTION 202 OF THE NATIONAL HOUSING ACT UNDER AN AGREEMENT WITH THE DEPARTMENT OF HUD.,,False,,,true,true,0,,"{'BOY': '250', 'EOY': '22261'}",false,,{'Total': '0'},{'Total': '0'},false,true,true,true,0,,,,93493313013111,true,true,,false,581805618,true,true,true,True,true,,"{'Total': '7500', 'ManagementAndGeneral': '7500'}",{'Total': '0'},{'Total': '0'},{'Total': '0'},"{'Total': '21600', 'ManagementAndGeneral': '21600'}",{'Total': '0'},{'Total': '0'},X,true,true,,,,,0,,,,266420,,1736.0,false,,false,828,1425,904332,2187206,,false,false,false,,X,,,true,true,"{'BOY': '6219', 'EOY': '7035'}",,13,19,19,1437850,1398765,,,,13,0,0,false,,,X,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,"{'Total': '17714', 'ProgramServices': '17714'}",189785,222550,"{'BOY': '9203', 'EOY': '11349'}",0,,"{'Total': '59440', 'ProgramServices': '59440'}",,,"{'Total': '5801', 'ProgramServices': '5801'}",{'Total': '0'},False,265592,222839,-39085,true,,,-39085,-36926,82955,71405,{'EOY': '0'},,WY,,,201106,,,"{'BOY': '1455332', 'EOY': '1433342'}",1455332,1433342,true,0,305505,261190,"{'Total': '305505', 'ProgramServices': '276405', 'ManagementAndGeneral': '29100', 'Fundraising': '0'}",0,0,17482,34577,0,,411648,0,0,,265592.0,1180355,,"{'TotalRevenueColumn': '266420', 'RelatedOrExemptFunctionIncome': '266420'}",266420,224264,,,X,,,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,,X,,false,False,1993,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2011-06-30,2010


# Combine all columns where *len*==2

### Define Function to combine columns
In Python we can thus create a series of functions that can be used as shortcuts. First we'll create a function called 'combine' that will combine two variables. It takes as *inputs* four things: our dataset/dataframe (*df*), the name we'd like for our new variable (*newvar*), the name of the first variable to combine (*var1*), and the name of the second variable to combine (*var2*).

In [89]:
def combine(df, newvar, var1, var2):
    df[newvar] = np.where(df[var1].notnull(), df[var1], df[var2])
    #print(df[newvar].value_counts().head(), '\n')
    #print('# of missing observations:', len(df[df[newvar].isnull()]))
    #print('# of valid observations:', len(df[df[newvar].notnull()]), '\n')  
    #return df.sample(5)[[newvar, var1, var2, 'DLN']] 
    #print(df[[newvar, var1, var2, 'ObjectId']][:5], '\n\n\n')

#### Do initial check to ensure that no row has values in both columns

In [90]:
for index, row in new_variables_df[new_variables_df['len']==2][:].iterrows():
    #print(row['variable_name_new'])
    print(row['variable_name_new'], row['original_names'][0], row['original_names'][1])
    print('\t\t', len(df[df[row['original_names'][0]].notnull()]))
    print('\t\t', len(df[df[row['original_names'][1]].notnull()]))
    #print(len(df[(df[row['original_names'][0]].isnull()) & (df[row['original_names'][1]].isnull())]), '\n\n')      
    print('OK IF ZERO:', len(df[(df[row['original_names'][0]].notnull()) & (df[row['original_names'][1]].notnull())]), '\n\n')

F9_00_HD_ADDR_CHANGE AddressChange AddressChangeInd
		 19701
		 54825
OK IF ZERO: 0 


F9_00_HD_AMENDED_RETURN AmendedReturn AmendedReturnInd
		 4675
		 11967
OK IF ZERO: 0 


F9_00_HD_CTRY_OF_DOMICILE CountryLegalDomicile LegalDomicileCountryCd
		 272
		 749
OK IF ZERO: 0 


F9_00_HD_EXEMPT_STATUS_4847A1 Organization4947a1NotPFInd Organization4947a1
		 777
		 737
OK IF ZERO: 0 


F9_00_HD_EXEMPT_STATUS_501C Organization501cInd Organization501c
		 340575
		 145796
OK IF ZERO: 0 


F9_00_HD_EXEMPT_STATUS_501C3 Organization501c3 Organization501c3Ind
		 348990
		 1058141
OK IF ZERO: 0 


F9_00_HD_FINAL_RETURN FinalReturnInd TerminatedReturn
		 8044
		 2079
OK IF ZERO: 0 


F9_00_HD_GROSS_EXEMPT_NUM GroupExemptionNum GroupExemptionNumber
		 46460
		 19031
OK IF ZERO: 0 


F9_00_HD_GROSS_RCPT GrossReceiptsAmt GrossReceipts
		 1399493
		 495523
OK IF ZERO: 0 


F9_00_HD_GROUP_RETURN GroupReturnForAffiliates GroupReturnForAffiliatesInd
		 495523
		 1399493
OK IF ZERO: 0 


F9_00_HD_INCLUDES_S

		 495523
		 1399493
OK IF ZERO: 0 


F9_06_PC_DOCUMENT_RET_POLICY DocumentRetentionPolicy DocumentRetentionPolicyInd
		 495523
		 1399493
OK IF ZERO: 0 


F9_06_PC_ELECTION_BOARD_MEMBERS ElectionOfBoardMembersInd ElectionOfBoardMembers
		 1399493
		 495523
OK IF ZERO: 0 


F9_06_PC_FAMILY_OR_BUSINESS_REL FamilyOrBusinessRelationship FamilyOrBusinessRlnInd
		 495523
		 1399493
OK IF ZERO: 0 


F9_06_PC_FORM_AVAIL_OWN_WEBSITE OwnWebsiteInd OwnWebsite
		 88817
		 29051
OK IF ZERO: 0 


F9_06_PC_FORM_UPON_REQUEST UponRequest UponRequestInd
		 457550
		 1267577
OK IF ZERO: 0 


F9_06_PC_JOINT_VENTURE_INVESTMNT InvestmentInJointVentureInd InvestmentInJointVenture
		 1399493
		 495523
OK IF ZERO: 0 


F9_06_PC_JOINT_VENTURE_POLICY WrittenPolicyOrProcedure WrittenPolicyOrProcedureInd
		 62690
		 37200
OK IF ZERO: 0 


F9_06_PC_LOCAL_CHAPTERS LocalChaptersInd LocalChapters
		 1399493
		 495523
OK IF ZERO: 0 


F9_06_PC_MATERIAL_DIVERSION MaterialDiversionOrMisuseInd MaterialDiversionOrMisuse
	

		 989244
		 367248
OK IF ZERO: 0 


F9_09_PC_PAYROLL_TAX_FUNDRAISE PayrollTaxes PayrollTaxesGrp
		 360420
		 979744
OK IF ZERO: 0 


F9_09_PC_PAYROLL_TAX_MGMT PayrollTaxes PayrollTaxesGrp
		 360420
		 979744
OK IF ZERO: 0 


F9_09_PC_PAYROLL_TAX_PROG_SVCE PayrollTaxes PayrollTaxesGrp
		 360420
		 979744
OK IF ZERO: 0 


F9_09_PC_PAYROLL_TAX_TOTAL PayrollTaxes PayrollTaxesGrp
		 360420
		 979744
OK IF ZERO: 0 


F9_09_PC_PENSION_CONT_FUNDRAISE PensionPlanContributionsGrp PensionPlanContributions
		 585076
		 230090
OK IF ZERO: 0 


F9_09_PC_PENSION_CONT_MGMT PensionPlanContributionsGrp PensionPlanContributions
		 585076
		 230090
OK IF ZERO: 0 


F9_09_PC_PENSION_CONT_PROG_SVCE PensionPlanContributionsGrp PensionPlanContributions
		 585076
		 230090
OK IF ZERO: 0 


F9_09_PC_PENSION_CONT_TOTAL PensionPlanContributionsGrp PensionPlanContributions
		 585076
		 230090
OK IF ZERO: 0 


F9_09_PC_TOTAL_FUNC_EXPENSES TotalFunctionalExpenses TotalFunctionalExpensesGrp
		 495523
		 1399493
OK I

##### Check *F9_07_PC_TOT_REPRT_COMP_RLTD_ORG*

<br>First run of code below showed problem with variable *F9_07_PC_TOT_REPRT_COMP_RLTD_ORG*

But now it seems to be OK.

In [31]:
#for index, row in new_variables_df[new_variables_df['variable_name_new']=='F9_07_PC_TOT_REPRT_COMP_RLTD_ORG'].iterrows():
#    #print(row['variable_name_new'])
#    print(row['variable_name_new'], row['original_names'][0], row['original_names'][1])
#    print('\t\t', len(df[df[row['original_names'][0]].notnull()]))
#    print('\t\t', len(df[df[row['original_names'][1]].notnull()]))
#    #print(len(df[(df[row['original_names'][0]].isnull()) & (df[row['original_names'][1]].isnull())]), '\n\n')      
#    print('OK IF ZERO:', len(df[(df[row['original_names'][0]].notnull()) & (df[row['original_names'][1]].notnull())]), '\n\n')

F9_07_PC_TOT_REPRT_COMP_RLTD_ORG TotReportableCompRltdOrgAmt TotalReportableCompFrmRltdOrgs
		 663252
		 283814
OK IF ZERO: 0 




In [20]:
#['variable_name_new']=='F9_07_PC_TOT_REPRT_COMP_RLTD_ORG']

Unnamed: 0,variable_name_new,original_names,data_type_xsd,len
101,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,"[TotReportableCompRltdOrgAmt, TotalReportableCompFrmRltdOrgs]",USAmountType,2


In [26]:
#df[['TotalReportableCompFrmRltdOrgs']].sample(1)

Unnamed: 0,TotalReportableCompFrmRltdOrgs
2962,


In [22]:
#df[['TotReportableCompRltdOrgAmt']].sample(5)

Unnamed: 0,TotReportableCompRltdOrgAmt
8508,
7210,
7076,0.0
133,
9717,


In [32]:
#df[['TotReportableCompRltdOrgAmt', 'TotalReportableCompFrmRltdOrgs', 'F9_07_PC_TOT_REPRT_COMP_RLTD_ORG']].sample(5)

Unnamed: 0,TotReportableCompRltdOrgAmt,TotalReportableCompFrmRltdOrgs,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG
2567,,,
9645,110000.0,,110000.0
7353,48000.0,,48000.0
7365,,0.0,0.0
6520,,,


In [30]:
#print(len(df[df['TotalReportableCompFrmRltdOrgs'].notnull()]))
#print(len(df[df['TotalReportableCompFrmRltdOrgs'].isnull()]))

283814
1443242


In [92]:
print(len(df[df['TotReportableCompRltdOrgAmt'].notnull()]))
print(len(df[df['TotReportableCompRltdOrgAmt'].isnull()]))

765562
1129454


### Combine

In [93]:
combo_fails = []
for index, row in new_variables_df[new_variables_df['len']==2][:].iterrows():
    print(row['variable_name_new'], row['original_names'][0], row['original_names'][1])
    try:
        combine(df, row['variable_name_new'], row['original_names'][0], row['original_names'][1])
    except:
        print('\n\n\n\n\n***********issue with variable: ', row['variable_name_new'])
        combo_fails.append(row['variable_name_new'])

print(combo_fails)

F9_00_HD_ADDR_CHANGE AddressChange AddressChangeInd
F9_00_HD_AMENDED_RETURN AmendedReturn AmendedReturnInd
F9_00_HD_CTRY_OF_DOMICILE CountryLegalDomicile LegalDomicileCountryCd
F9_00_HD_EXEMPT_STATUS_4847A1 Organization4947a1NotPFInd Organization4947a1
F9_00_HD_EXEMPT_STATUS_501C Organization501cInd Organization501c
F9_00_HD_EXEMPT_STATUS_501C3 Organization501c3 Organization501c3Ind
F9_00_HD_FINAL_RETURN FinalReturnInd TerminatedReturn
F9_00_HD_GROSS_EXEMPT_NUM GroupExemptionNum GroupExemptionNumber
F9_00_HD_GROSS_RCPT GrossReceiptsAmt GrossReceipts
F9_00_HD_GROUP_RETURN GroupReturnForAffiliates GroupReturnForAffiliatesInd
F9_00_HD_INCLUDES_SUBORD_ORGS AllAffiliatesIncluded AllAffiliatesIncludedInd
F9_00_HD_INITIAL_RETURN InitialReturn InitialReturnInd
F9_00_HD_PRIN_OFF_NAME PrincipalOfficerNm NameOfPrincipalOfficerPerson
F9_00_HD_SIGNING_OFFICER_SIGNTR BusinessOfficerGrp Officer
F9_00_HD_SPECIAL_CONDITION_DESC SpecialConditionDesc SpecialConditionDescription
F9_00_HD_STATE_OF_DOMICILE

F9_08_PC_FUNDRAISING_EVENTS FundraisingEvents FundraisingAmt
F9_08_PC_FUNDRAISING_GROSS_INC GrossIncomeFundraisingEvents FundraisingGrossIncomeAmt
F9_08_PC_GAMING_DIRECT_EXPENSES GamingDirectExpenses GamingDirectExpensesAmt
F9_08_PC_GAMING_GROSS_INCOME GamingGrossIncomeAmt GrossIncomeGaming
F9_08_PC_GOVERNMENT_GRANTS GovernmentGrants GovernmentGrantsAmt
F9_08_PC_GROSS_SALES_INVENTORY GrossSalesOfInventoryAmt GrossSalesOfInventory
F9_08_PC_MEMBERSHIP_DUES MembershipDuesAmt MembershipDues
F9_08_PC_NONCASH_CONTRIBUTIONS NoncashContributions NoncashContributionsAmt
F9_08_PC_PROGRAM_SVCE_REV_TOTAL TotalProgramServiceRevenue TotalProgramServiceRevenueAmt
F9_08_PC_RELATED_ORGANIZATIONS RelatedOrganizationsAmt RelatedOrganizations
F9_08_PC_TOTAL_CONTRIBUTIONS TotalContributionsAmt TotalContributions
F9_08_PC_TOTAL_OTHER_REVENUE TotalOtherRevenue OtherRevenueTotalAmt
F9_08_PC_TOTAL_PROG_SVCE_REVENUE TotalProgramServiceRevenue TotalProgramServiceRevenueAmt
F9_08_PC_TOTAL_REVENUE TotalRevenueGrp 

#### Save DF

In [34]:
#import timeit
#start_time = timeit.default_timer()
#df.to_pickle('all filings - with 185 newly named control variables.pkl')
#elapsed = timeit.default_timer() - start_time
#rint('# of minutes: ', elapsed/60) 

# of minutes:  4.553268258333325


In [94]:
%%time
df.to_pickle('all filings nov. 2020 - all control variables (renamed).pkl.gz', compression='gzip')

Wall time: 1h 8min 28s


### Binarize

In [171]:
binarize_cols = [c for c in new_variables_df[new_variables_df['data_type_xsd'].isin(['BooleanType', 'CheckboxType'])]['variable_name_new'].tolist()] 
print(len(binarize_cols))
print(binarize_cols)

58
['F9_00_HD_ADDR_CHANGE', 'F9_00_HD_AMENDED_RETURN', 'F9_00_HD_EXEMPT_STATUS_4847A1', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_EXEMPT_STATUS_501C3', 'F9_00_HD_FINAL_RETURN', 'F9_00_HD_GROUP_RETURN', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 'F9_00_HD_INITIAL_RETURN', 'F9_00_HD_TYPE_ORG_ASSOCIATION', 'F9_00_HD_TYPE_ORG_CORP', 'F9_00_HD_TYPE_ORG_OTHER', 'F9_00_HD_TYPE_ORG_TRUST', 'F9_01_PC_TERMINATION_CONTRACTION', 'F9_04_PC_FR_EVENT_INC_GT_15K', 'F9_04_PC_GAMING_INC_GT_15K', 'F9_04_PC_PROF_FR_EXP_GT_15K', 'F9_06_PC_990_PROVIDED_GOV_BODY', 'F9_06_PC_ANNUAL_DISC_COVRD_PERS', 'F9_06_PC_CEO_COMPENSTN_PROCESS', 'F9_06_PC_CHANGES_ORGANIZING_DOCS', 'F9_06_PC_CONFLICT_OF_INTEREST', 'F9_06_PC_DECISIONS_SUBJ_APPROVAL', 'F9_06_PC_DELEGATION_MGT_DUTIES', 'F9_06_PC_DELEGATION_OF_MGT', 'F9_06_PC_DOCUMENT_RET_POLICY', 'F9_06_PC_ELECTION_BOARD_MEMBERS', 'F9_06_PC_FAMILY_OR_BUSINESS_REL', 'F9_06_PC_FORM_AVAIL_OWN_WEBSITE', 'F9_06_PC_FORM_UPON_REQUEST', 'F9_06_PC_JOINT_VENTURE_INVESTMNT', 'F9_06_PC_JOINT_VENTUR

In [96]:
for c in binarize_cols[:]:
    print(df[df[c].notnull()][c].value_counts().head(), '\n')

X    74526
Name: F9_00_HD_ADDR_CHANGE, dtype: int64 

X    16642
Name: F9_00_HD_AMENDED_RETURN, dtype: int64 

X    1514
Name: F9_00_HD_EXEMPT_STATUS_4847A1, dtype: int64 

{'@organization501cTypeTxt': '6', '#text': 'X'}    98981
{'@organization501cTypeTxt': '4', '#text': 'X'}    50870
{'@organization501cTypeTxt': '5', '#text': 'X'}    43146
{'@organization501cTypeTxt': '7', '#text': 'X'}    40186
{'@typeOf501cOrganization': '6', '#text': 'X'}     32959
Name: F9_00_HD_EXEMPT_STATUS_501C, dtype: int64 

X    1407131
Name: F9_00_HD_EXEMPT_STATUS_501C3, dtype: int64 

X    10123
Name: F9_00_HD_FINAL_RETURN, dtype: int64 

false    1112425
0         778013
true        2360
1           2218
Name: F9_00_HD_GROUP_RETURN, dtype: int64 

false                                                         283830
1                                                               7243
true                                                            5132
{'@referenceDocumentId': '', '#text': 'false'}        

##### BASED ON THE PRECEDING CODE BLOCK, THE FOLLOWING VARIABLES WILL LIKELY FAIL THE BINARIZATION PROCESS:

In [None]:
F9_00_HD_EXEMPT_STATUS_501C
F9_00_HD_INCLUDES_SUBORD_ORGS
F9_04_PC_FR_EVENT_INC_GT_15K
F9_04_PC_GAMING_INC_GT_15K
F9_04_PC_PROF_FR_EXP_GT_15K
F9_12_PC_ACCTG_METHOD_OTHER

- F9_12_PC_ACCTG_METHOD_OTHER
    - some values are 'X' some are dictionaries --> depending on number of values, either do *np.where* or flatten, etc.
    - {'#text': 'X', '@methodOfAccountingOtherDesc': 'MODIFIED CASH'}          7120
    - X                                                                        3197
    - {'@note': 'MODIFIED CASH', '#text': 'X'}                                 3117
    - {'#text': 'X', '@methodOfAccountingOtherDesc': 'Modified Cash'}          1715
    - {'#text': 'X', '@methodOfAccountingOtherDesc': 'MODIFIED CASH BASIS'}    1033

In [163]:
binarize_with_dict_cols = ['F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 
                           'F9_04_PC_FR_EVENT_INC_GT_15K', 'F9_04_PC_GAMING_INC_GT_15K',
                           'F9_04_PC_PROF_FR_EXP_GT_15K', 'F9_12_PC_ACCTG_METHOD_OTHER']

In [106]:
for c in binarize_with_dict_cols[:]:
    print(df[df[c].notnull()][c].value_counts()[:10], '\n')

{'@organization501cTypeTxt': '6', '#text': 'X'}     98981
{'@organization501cTypeTxt': '4', '#text': 'X'}     50870
{'@organization501cTypeTxt': '5', '#text': 'X'}     43146
{'@organization501cTypeTxt': '7', '#text': 'X'}     40186
{'@typeOf501cOrganization': '6', '#text': 'X'}      32959
{'@typeOf501cOrganization': '3', '#text': 'X'}      28339
{'@organization501cTypeTxt': '9', '#text': 'X'}     22467
{'@typeOf501cOrganization': '4', '#text': 'X'}      16427
{'@organization501cTypeTxt': '8', '#text': 'X'}     16414
{'@organization501cTypeTxt': '19', '#text': 'X'}    16023
Name: F9_00_HD_EXEMPT_STATUS_501C, dtype: int64 

false                                                             283830
1                                                                   7243
true                                                                5132
{'@referenceDocumentId': '', '#text': 'false'}                       462
{'@referenceDocumentId': 'RetDoc2030100001', '#text': '0'}           429
0    

In [98]:
print(len(binarize_cols))
binarize_cols = list(set(binarize_cols) - set(binarize_with_dict_cols))
print(len(binarize_cols))

58
52


##### Check *F9_12_PC_ACCTG_METHOD_OTHER*, *F9_00_HD_EXEMPT_STATUS_501C*, and *F9_00_HD_INCLUDES_SUBORD_ORGS*
Based on the following frequencies,for *F9_12_PC_ACCTG_METHOD_OTHER* do an *np.where* and make it 'other'. Leave *F9_00_HD_EXEMPT_STATUS_501C* and *F9_00_HD_INCLUDES_SUBORD_ORGS* alone.

In [105]:
print(df[df['F9_00_HD_EXEMPT_STATUS_501C'].notnull()]['F9_00_HD_EXEMPT_STATUS_501C'].value_counts().head())

{'@organization501cTypeTxt': '6', '#text': 'X'}    98981
{'@organization501cTypeTxt': '4', '#text': 'X'}    50870
{'@organization501cTypeTxt': '5', '#text': 'X'}    43146
{'@organization501cTypeTxt': '7', '#text': 'X'}    40186
{'@typeOf501cOrganization': '6', '#text': 'X'}     32959
Name: F9_00_HD_EXEMPT_STATUS_501C, dtype: int64


#### Fix *F9_00_HD_EXEMPT_STATUS_501C*

In [None]:
def func(x, key1, key2):
    if pd.isnull(x):
        return np.nan
    #else: 
    #    mydict = ast.literal_eval(x)
    elif key1 in x.keys():
        return x[key1]
    elif key2 in x.keys():
        return x[key2]
    else:
        return np.nan

In [113]:
df['F9_00_HD_EXEMPT_STATUS_501C'] = df['F9_00_HD_EXEMPT_STATUS_501C'][:].apply(func, 
                            key1='@organization501cTypeTxt', key2 ='@typeOf501cOrganization')

In [114]:
#df = df.drop('Organization501c_type', 1)

In [117]:
df['F9_00_HD_EXEMPT_STATUS_501C'].value_counts()[:10]

6     131940
4      67297
5      57680
7      53636
9      32535
3      28339
8      21452
19     20858
12     19202
14     18970
Name: F9_00_HD_EXEMPT_STATUS_501C, dtype: int64

#### Fix other five variables

In [119]:
binarize_with_dict_cols

['F9_00_HD_INCLUDES_SUBORD_ORGS',
 'F9_04_PC_FR_EVENT_INC_GT_15K',
 'F9_04_PC_GAMING_INC_GT_15K',
 'F9_04_PC_PROF_FR_EXP_GT_15K',
 'F9_12_PC_ACCTG_METHOD_OTHER']

In [121]:
def func_text(x, key1):
    if pd.isnull(x):
        return np.nan
    #else: 
    #    mydict = ast.literal_eval(x)
    
    elif type(x)==dict: 
        if key1 in x.keys():
            return x[key1]
    else:
        return x

In [141]:
df['F9_00_HD_INCLUDES_SUBORD_ORGS'] = df['F9_00_HD_INCLUDES_SUBORD_ORGS'][:].apply(func_text, 
                            key1='#text')

In [142]:
#df = df.drop('test', 1)

In [144]:
df['F9_00_HD_INCLUDES_SUBORD_ORGS'].value_counts()

false    284676
1          7243
true       5345
0           640
Name: F9_00_HD_INCLUDES_SUBORD_ORGS, dtype: int64

In [146]:
#binarize_with_dict_cols

['F9_04_PC_FR_EVENT_INC_GT_15K',
 'F9_04_PC_GAMING_INC_GT_15K',
 'F9_04_PC_PROF_FR_EXP_GT_15K',
 'F9_12_PC_ACCTG_METHOD_OTHER']

In [148]:
df['F9_04_PC_FR_EVENT_INC_GT_15K'] = df['F9_04_PC_FR_EVENT_INC_GT_15K'][:].apply(func_text, 
                            key1='#text')

In [149]:
df['F9_04_PC_FR_EVENT_INC_GT_15K'].value_counts()

false    863971
0        587063
true     250814
1        193168
Name: F9_04_PC_FR_EVENT_INC_GT_15K, dtype: int64

In [147]:
for c in binarize_with_dict_cols[:]:
    print(df[df[c].notnull()][c].value_counts()[:10], '\n')

false                                                           841251
0                                                               571809
{'@referenceDocumentId': 'RetDoc1041300001', '#text': '1'}      136243
{'@referenceDocumentId': 'IRS990ScheduleG', '#text': 'true'}     88991
{'@referenceDocumentId': '990G', '#text': 'true'}                35786
{'@referenceDocumentId': 'RetDoc5', '#text': 'true'}             30266
1                                                                26762
{'@referenceDocumentId': 'RetDoc4', '#text': 'true'}             24097
{'@referenceDocumentId': 'RetDoc1041200001', '#text': '1'}       21590
{'@referenceDocumentId': 'RetDoc6', '#text': 'true'}             14389
Name: F9_04_PC_FR_EVENT_INC_GT_15K, dtype: int64 

false                                                            891322
0                                                                603964
{'@referenceDocumentId': 'RetDoc1041300001', '#text': '0'}       130262
{'@referenceDocumentId'

In [150]:
df['F9_04_PC_GAMING_INC_GT_15K'] = df['F9_04_PC_GAMING_INC_GT_15K'][:].apply(func_text, 
                            key1='#text')

In [152]:
df['F9_04_PC_GAMING_INC_GT_15K'].value_counts()

false    1073662
0         764352
true       41123
1          15879
Name: F9_04_PC_GAMING_INC_GT_15K, dtype: int64

In [153]:
df['F9_04_PC_PROF_FR_EXP_GT_15K'] = df['F9_04_PC_PROF_FR_EXP_GT_15K'][:].apply(func_text, 
                            key1='#text')

In [154]:
df['F9_04_PC_FR_EVENT_INC_GT_15K'].value_counts()

false    863971
0        587063
true     250814
1        193168
Name: F9_04_PC_FR_EVENT_INC_GT_15K, dtype: int64

#### Fix *F9_12_PC_ACCTG_METHOD_OTHER*

In [157]:
def func_text2(x, key1, key2):
    if pd.isnull(x):
        return np.nan
    #else: 
    #    mydict = ast.literal_eval(x)
    
    elif type(x)==dict: 
        if key1 in x.keys():
            return x[key1]
        elif key2 in x.keys():
            return x[key2]
    else:
        return x

In [160]:
df['F9_12_PC_ACCTG_METHOD_OTHER'] = df['F9_12_PC_ACCTG_METHOD_OTHER'][:].apply(func_text2, 
                            key1='@note', key2='@methodOfAccountingOtherDesc')

In [161]:
#df=df.drop('test', 1)

In [162]:
df['F9_12_PC_ACCTG_METHOD_OTHER'].value_counts()[:10]

MODIFIED CASH          15005
X                       4126
Modified Cash           3656
MODIFIED CASH BASIS     2062
modified cash           1109
HYBRID                  1090
Modified cash            816
Modified Cash Basis      789
MODIFIED ACCRUAL         736
MODIFIED CAS             674
Name: F9_12_PC_ACCTG_METHOD_OTHER, dtype: int64

##### Remove two variables from *binarize_cols*

In [172]:
print(len(binarize_cols))
binarize_cols.remove('F9_12_PC_ACCTG_METHOD_OTHER') 
binarize_cols.remove('F9_00_HD_EXEMPT_STATUS_501C')
print(len(binarize_cols))

58
56


#### Create *501c3* variable

In [174]:
df['F9_00_HD_EXEMPT_STATUS_501C'].value_counts()[:10]

6     131940
4      67297
5      57680
7      53636
9      32535
3      28339
8      21452
19     20858
12     19202
14     18970
Name: F9_00_HD_EXEMPT_STATUS_501C, dtype: int64

In [175]:
df['F9_00_HD_EXEMPT_STATUS_501C3'].value_counts()

X    1407131
Name: F9_00_HD_EXEMPT_STATUS_501C3, dtype: int64

In [167]:
pd.crosstab(df['F9_00_HD_EXEMPT_STATUS_501C3'], df['F9_00_HD_EXEMPT_STATUS_501C'])

In [176]:
1407131+28339

1435470

In [177]:
df['501c3'] = np.where(df['F9_00_HD_EXEMPT_STATUS_501C3']=='X', 1, 0)
print(df['501c3'].value_counts(),'\n')
print(348990+1058141)
df['501c3'] = np.where(df['F9_00_HD_EXEMPT_STATUS_501C']=='3', 1, df['501c3'])
print(df['501c3'].value_counts())

1    1407131
0     487885
Name: 501c3, dtype: int64 

1407131
1    1435470
0     459546
Name: 501c3, dtype: int64


#### Save DF

In [179]:
%%time
df.to_pickle('all filings nov. 2020 - all control variables (renamed).pkl.gz', compression='gzip')

Wall time: 1h 24min 22s


# Binarize Columns

In [180]:
for col in binarize_cols:
    print(df[col].value_counts(), '\n\n')

X    74526
Name: F9_00_HD_ADDR_CHANGE, dtype: int64 


X    16642
Name: F9_00_HD_AMENDED_RETURN, dtype: int64 


X    1514
Name: F9_00_HD_EXEMPT_STATUS_4847A1, dtype: int64 


X    1407131
Name: F9_00_HD_EXEMPT_STATUS_501C3, dtype: int64 


X    10123
Name: F9_00_HD_FINAL_RETURN, dtype: int64 


false    1112425
0         778013
true        2360
1           2218
Name: F9_00_HD_GROUP_RETURN, dtype: int64 


false    284676
1          7243
true       5345
0           640
Name: F9_00_HD_INCLUDES_SUBORD_ORGS, dtype: int64 


X    18113
Name: F9_00_HD_INITIAL_RETURN, dtype: int64 


X    83569
Name: F9_00_HD_TYPE_ORG_ASSOCIATION, dtype: int64 


X    1666981
Name: F9_00_HD_TYPE_ORG_CORP, dtype: int64 


X    45946
Name: F9_00_HD_TYPE_ORG_OTHER, dtype: int64 


X    60917
Name: F9_00_HD_TYPE_ORG_TRUST, dtype: int64 


X    13530
Name: F9_01_PC_TERMINATION_CONTRACTION, dtype: int64 


false    863971
0        587063
true     250814
1        193168
Name: F9_04_PC_FR_EVENT_INC_GT_15K, dtype: in

In [181]:
print(len(binarize_cols))

56


In [182]:
def binarize(df, variable):
    print(df[variable].value_counts(), '\n')
    df[variable] = np.where(df[variable]=='true', 1, df[variable])
    df[variable] = np.where(df[variable]=='false', 0, df[variable])
    df[variable] = np.where(df[variable]=='1', 1, df[variable])
    df[variable] = np.where(df[variable]=='0', 0, df[variable])
    df[variable] = np.where(df[variable]=='X', 1, df[variable])
    print(df[variable].value_counts(), '\n\n')
    return df.sample(10)[['EIN', variable]]

In [183]:
for col in binarize_cols:
    binarize(df, col)

X    74526
Name: F9_00_HD_ADDR_CHANGE, dtype: int64 

1    74526
Name: F9_00_HD_ADDR_CHANGE, dtype: int64 


X    16642
Name: F9_00_HD_AMENDED_RETURN, dtype: int64 

1    16642
Name: F9_00_HD_AMENDED_RETURN, dtype: int64 


X    1514
Name: F9_00_HD_EXEMPT_STATUS_4847A1, dtype: int64 

1    1514
Name: F9_00_HD_EXEMPT_STATUS_4847A1, dtype: int64 


X    1407131
Name: F9_00_HD_EXEMPT_STATUS_501C3, dtype: int64 

1    1407131
Name: F9_00_HD_EXEMPT_STATUS_501C3, dtype: int64 


X    10123
Name: F9_00_HD_FINAL_RETURN, dtype: int64 

1    10123
Name: F9_00_HD_FINAL_RETURN, dtype: int64 


false    1112425
0         778013
true        2360
1           2218
Name: F9_00_HD_GROUP_RETURN, dtype: int64 

0    1890438
1       4578
Name: F9_00_HD_GROUP_RETURN, dtype: int64 


false    284676
1          7243
true       5345
0           640
Name: F9_00_HD_INCLUDES_SUBORD_ORGS, dtype: int64 

0    285316
1     12588
Name: F9_00_HD_INCLUDES_SUBORD_ORGS, dtype: int64 


X    18113
Name: F9_00_HD_INITIAL_R

1    426692
Name: F9_10_PC_ORG_NOT_FOLLOW_SFAS117, dtype: int64 


false    933125
0        713485
true     181660
1         66746
Name: F9_12_PC_ACCNT_COMPILE_OR_REVIEW, dtype: int64 

0    1646610
1     248406
Name: F9_12_PC_ACCNT_COMPILE_OR_REVIEW, dtype: int64 


X    1263945
Name: F9_12_PC_ACCTG_METHOD_ACCRUAL, dtype: int64 

1    1263945
Name: F9_12_PC_ACCTG_METHOD_ACCRUAL, dtype: int64 


X    591946
Name: F9_12_PC_ACCTG_METHOD_CASH, dtype: int64 

1    591946
Name: F9_12_PC_ACCTG_METHOD_CASH, dtype: int64 


1        458980
true     445920
false    132225
0         86728
Name: F9_12_PC_AUDIT_COMMITTEE, dtype: int64 

1    904900
0    218953
Name: F9_12_PC_AUDIT_COMMITTEE, dtype: int64 


1        99466
true     81735
false    52656
0          631
Name: F9_12_PC_FED_GRNT_AUDIT_PERFORMD, dtype: int64 

1    181201
0     53287
Name: F9_12_PC_FED_GRNT_AUDIT_PERFORMD, dtype: int64 


false    856280
0        680157
1         99977
true      81907
Name: F9_12_PC_FED_GRNT_AUDIT_REQUIR

In [185]:
df[binarize_cols].sample(10)

Unnamed: 0,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_TRUST,F9_01_PC_TERMINATION_CONTRACTION,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED
1179867,,,,1,,0,,,,1.0,,,,0,0,0,1,,1,0,0,0,0,0,0,0,1,,1,0,,0,0,0,1,1,,0,1,,,,0,0,1,,0,1.0,,0,,1.0,,,0,0
643125,,,,1,,0,,,,,1.0,,,0,0,0,1,1.0,1,0,1,0,0,0,0,0,0,,1,0,,0,0,0,0,1,0.0,0,0,,,,0,0,0,,0,1.0,,0,1.0,,1.0,1.0,1,1
894438,,,,1,,0,,,,1.0,,,,1,0,0,1,,1,0,0,0,0,0,0,0,0,,1,0,,0,0,0,1,1,,0,1,,,,0,0,0,,0,1.0,,0,,1.0,,,0,0
1886060,,,,1,,0,0.0,1.0,,1.0,,,,0,0,0,1,0.0,0,0,0,0,1,1,0,0,0,,1,0,,0,0,0,0,1,0.0,0,0,,,,0,0,0,1.0,0,1.0,,0,1.0,,,,0,0
1679603,,,,1,,0,,,,1.0,,,,0,0,0,0,0.0,1,0,1,0,0,0,0,0,0,,1,0,,0,0,0,1,1,1.0,0,0,,,,1,0,0,,0,1.0,,0,1.0,,0.0,,0,1
1182054,,1.0,,1,,0,,,,1.0,,,,0,0,0,1,,0,0,0,0,0,0,0,0,0,,1,0,,0,0,0,1,1,,0,0,,,,0,0,0,,0,1.0,,0,,1.0,,,0,0
1808096,,,,1,,0,,,,1.0,,,,0,0,0,1,1.0,1,0,1,0,0,0,0,0,0,1.0,1,0,,0,0,0,0,1,1.0,0,0,1.0,1.0,,1,0,0,,0,,,0,,1.0,,,0,0
805562,,,,1,,0,,,,,1.0,,,0,0,0,1,1.0,0,0,1,0,0,0,0,0,0,,1,0,,0,0,0,1,1,1.0,0,0,,,,1,0,0,1.0,0,1.0,,1,1.0,,1.0,,0,0
505590,,,,1,,0,,,,1.0,,,,1,0,0,1,1.0,1,0,1,0,0,0,1,0,0,,1,0,,0,0,0,1,1,1.0,0,0,1.0,,,0,0,0,1.0,0,,1.0,0,1.0,,,,0,0
513674,1.0,,,1,,0,,,,1.0,,,,0,0,0,1,1.0,1,0,1,0,0,0,1,0,0,,1,0,,1,0,0,1,1,1.0,0,0,,,1.0,1,0,0,,0,1.0,,0,1.0,,,,0,0


### Check that total number of values in new variable equal sum of prior 2 variables

In [190]:
for index, row in new_variables_df[new_variables_df['len']==2][:].iterrows():
    #print(row['variable_name_new'])
    print(row['variable_name_new'], row['original_names'][0], row['original_names'][1])
    print(len(df[df[row['original_names'][0]].notnull()]) + len(df[df[row['original_names'][1]].notnull()]))    
    print(len(df[df[row['variable_name_new']].notnull()]), '\n')
    #print(len(df[(df[row['original_names'][0]].notnull()) & (df[row['original_names'][1]].notnull())]), '\n')     

F9_00_HD_ADDR_CHANGE AddressChange AddressChangeInd
74526
74526 

F9_00_HD_AMENDED_RETURN AmendedReturn AmendedReturnInd
16642
16642 

F9_00_HD_CTRY_OF_DOMICILE CountryLegalDomicile LegalDomicileCountryCd
1021
1021 

F9_00_HD_EXEMPT_STATUS_4847A1 Organization4947a1NotPFInd Organization4947a1
1514
1514 

F9_00_HD_EXEMPT_STATUS_501C Organization501cInd Organization501c
486371
486371 

F9_00_HD_EXEMPT_STATUS_501C3 Organization501c3 Organization501c3Ind
1407131
1407131 

F9_00_HD_FINAL_RETURN FinalReturnInd TerminatedReturn
10123
10123 

F9_00_HD_GROSS_EXEMPT_NUM GroupExemptionNum GroupExemptionNumber
65491
65491 

F9_00_HD_GROSS_RCPT GrossReceiptsAmt GrossReceipts
1895016
1895016 

F9_00_HD_GROUP_RETURN GroupReturnForAffiliates GroupReturnForAffiliatesInd
1895016
1895016 

F9_00_HD_INCLUDES_SUBORD_ORGS AllAffiliatesIncluded AllAffiliatesIncludedInd
297904
297904 

F9_00_HD_INITIAL_RETURN InitialReturn InitialReturnInd
18113
18113 

F9_00_HD_PRIN_OFF_NAME PrincipalOfficerNm NameOfPrincipal

1895016
1895016 

F9_06_PC_OTHER_COMPENSTN_PROCESS CompensationProcessOther CompensationProcessOtherInd
1882164
1882164 

F9_06_PC_OTHER_WEBSITE OtherWebsite OtherWebsiteInd
251612
251612 

F9_06_PC_OWN_WEBSITE OwnWebsiteInd OwnWebsite
117868
117868 

F9_06_PC_POLICIES_GOVERN_CHAPTER PoliciesReferenceChaptersInd PoliciesReferenceChapters
150852
150852 

F9_06_PC_STATES_WHERE_RET_FILED StatesWhereCopyOfReturnIsFiled StatesWhereCopyOfReturnIsFldCd
967132
967132 

F9_06_PC_WHISTLEBLOWER_POLICY WhistleblowerPolicy WhistleblowerPolicyInd
1895016
1895016 

F9_07_PC_COMPENSATION_OTHER_SRCE CompensationFromOtherSources CompensationFromOtherSrcsInd
1895016
1895016 

F9_07_PC_FORMER_OFFICER_LISTED FormerOfcrEmployeesListedInd FormersListed
1895016
1895016 

F9_07_PC_NO_LISTED_PERS_COMPENSD NoListedPersonsCompensated NoListedPersonsCompensatedInd
805206
805206 

F9_07_PC_NUM_CONTRCTRS_GRTR_100K NumberOfContractorsGT100K CntrctRcvdGreaterThan100KCnt
1190581
1190581 

F9_07_PC_NUM_INDS_GREATER_100K

40632
40632 

F9_11_PC_RECNCLTN_PRIOR_PER_ADJ ReconcilationPriorAdjustment PriorPeriodAdjustmentsAmt
141358
141358 

F9_11_PC_RECNCLTN_REV_LESS_EXP ReconcilationRevenueExpnssAmt ReconcilationRevenueExpenses
1861706
1861706 

F9_11_PC_RECNCLTN_UNRLZD_GAIN ReconciliationUnrealizedInvest NetUnrlzdGainsLossesInvstAmt
401729
401729 

F9_12_PC_ACCNT_COMPILE_OR_REVIEW AccountantCompileOrReview AccountantCompileOrReviewInd
1895016
1895016 

F9_12_PC_ACCTG_METHOD_ACCRUAL MethodOfAccountingAccrualInd MethodOfAccountingAccrual
1263945
1263945 

F9_12_PC_ACCTG_METHOD_CASH MethodOfAccountingCash MethodOfAccountingCashInd
591946
591946 

F9_12_PC_ACCTG_METHOD_OTHER MethodOfAccountingOtherInd MethodOfAccountingOther
39125
39125 

F9_12_PC_AUDIT_COMMITTEE AuditCommittee AuditCommitteeInd
1123853
1123853 

F9_12_PC_FED_GRNT_AUDIT_PERFORMD FederalGrantAuditPerformedInd FederalGrantAuditPerformed
234488
234488 

F9_12_PC_FED_GRNT_AUDIT_REQUIRED FederalGrantAuditRequiredInd FederalGrantAuditRequired
17183

<br><br>
From the above we are fine with deleting the 138 variables related to the 69 above variables in *variable_name_new* (numbers from earlier version of notebook).

### Inspect the Combined and Original Variables
Here I'm showing one variable

In [191]:
df[df['F9_12_PC_FED_GRNT_AUDIT_REQUIRED'].notnull()].sample(5)[['F9_12_PC_FED_GRNT_AUDIT_REQUIRED', 'FederalGrantAuditRequiredInd', 'FederalGrantAuditRequired']]

Unnamed: 0,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,FederalGrantAuditRequiredInd,FederalGrantAuditRequired
1781328,0,false,
1449422,0,false,
1781643,0,false,
1518949,0,0,
1850438,0,false,


### Drop variables

In [192]:
new_variables_df['original_names'].tolist()

[['AddressChange', 'AddressChangeInd'],
 ['AmendedReturn', 'AmendedReturnInd'],
 ['BuildTS'],
 ['CountryLegalDomicile', 'LegalDomicileCountryCd'],
 ['Organization4947a1NotPFInd', 'Organization4947a1'],
 ['Organization501cInd', 'Organization501c'],
 ['Organization501c3', 'Organization501c3Ind'],
 ['Filer'],
 ['FinalReturnInd', 'TerminatedReturn'],
 ['GroupExemptionNum', 'GroupExemptionNumber'],
 ['GrossReceiptsAmt', 'GrossReceipts'],
 ['GroupReturnForAffiliates', 'GroupReturnForAffiliatesInd'],
 ['AllAffiliatesIncluded', 'AllAffiliatesIncludedInd'],
 ['InitialReturn', 'InitialReturnInd'],
 ['PrincipalOfficerNm', 'NameOfPrincipalOfficerPerson'],
 ['BusinessOfficerGrp', 'Officer'],
 ['SpecialConditionDesc', 'SpecialConditionDescription'],
 ['LegalDomicileStateCd', 'StateLegalDomicile'],
 ['TaxPeriodEndDate', 'TaxPeriodEndDt'],
 ['TaxYr', 'TaxYear'],
 ['ReturnTs', 'Timestamp'],
 ['TypeOfOrganizationAssocInd', 'TypeOfOrganizationAssociation'],
 ['TypeOfOrganizationCorpInd', 'TypeOfOrganizat

In [193]:
new_variables_df[new_variables_df['len']!=2]['original_names'].tolist()

[['BuildTS'], ['Filer'], ['FeesForServicesProfFundraising'], ['TaxPeriod']]

In [194]:
new_variables_df[new_variables_df['len']==2]['original_names'].tolist()

[['AddressChange', 'AddressChangeInd'],
 ['AmendedReturn', 'AmendedReturnInd'],
 ['CountryLegalDomicile', 'LegalDomicileCountryCd'],
 ['Organization4947a1NotPFInd', 'Organization4947a1'],
 ['Organization501cInd', 'Organization501c'],
 ['Organization501c3', 'Organization501c3Ind'],
 ['FinalReturnInd', 'TerminatedReturn'],
 ['GroupExemptionNum', 'GroupExemptionNumber'],
 ['GrossReceiptsAmt', 'GrossReceipts'],
 ['GroupReturnForAffiliates', 'GroupReturnForAffiliatesInd'],
 ['AllAffiliatesIncluded', 'AllAffiliatesIncludedInd'],
 ['InitialReturn', 'InitialReturnInd'],
 ['PrincipalOfficerNm', 'NameOfPrincipalOfficerPerson'],
 ['BusinessOfficerGrp', 'Officer'],
 ['SpecialConditionDesc', 'SpecialConditionDescription'],
 ['LegalDomicileStateCd', 'StateLegalDomicile'],
 ['TaxPeriodEndDate', 'TaxPeriodEndDt'],
 ['TaxYr', 'TaxYear'],
 ['ReturnTs', 'Timestamp'],
 ['TypeOfOrganizationAssocInd', 'TypeOfOrganizationAssociation'],
 ['TypeOfOrganizationCorpInd', 'TypeOfOrganizationCorporation'],
 ['TypeO

In [195]:
flat_list = [item for sublist in new_variables_df[new_variables_df['len']==2]['original_names'].tolist() for item in sublist]
print(len(flat_list))
print(flat_list[:])

378
['AddressChange', 'AddressChangeInd', 'AmendedReturn', 'AmendedReturnInd', 'CountryLegalDomicile', 'LegalDomicileCountryCd', 'Organization4947a1NotPFInd', 'Organization4947a1', 'Organization501cInd', 'Organization501c', 'Organization501c3', 'Organization501c3Ind', 'FinalReturnInd', 'TerminatedReturn', 'GroupExemptionNum', 'GroupExemptionNumber', 'GrossReceiptsAmt', 'GrossReceipts', 'GroupReturnForAffiliates', 'GroupReturnForAffiliatesInd', 'AllAffiliatesIncluded', 'AllAffiliatesIncludedInd', 'InitialReturn', 'InitialReturnInd', 'PrincipalOfficerNm', 'NameOfPrincipalOfficerPerson', 'BusinessOfficerGrp', 'Officer', 'SpecialConditionDesc', 'SpecialConditionDescription', 'LegalDomicileStateCd', 'StateLegalDomicile', 'TaxPeriodEndDate', 'TaxPeriodEndDt', 'TaxYr', 'TaxYear', 'ReturnTs', 'Timestamp', 'TypeOfOrganizationAssocInd', 'TypeOfOrganizationAssociation', 'TypeOfOrganizationCorpInd', 'TypeOfOrganizationCorporation', 'TypeOfOrganizationOther', 'TypeOfOrganizationOtherInd', 'TypeOfOr

<br> Flatten a list of lists: https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists

In [196]:
print(len([c for c in df.columns.tolist() if c not in flat_list]))
print([c for c in df.columns.tolist() if c not in flat_list])

199
['OrganizationName', 'URL', 'DLN', 'TaxPeriod', 'EIN', 'F9_09_PC_FEES_FOR_SVCE_FR_TOT', 'fiscal_year', 'Filer', 'F9_00_HD_BUILD_TIME_STAMP', 'F9_00_HD_ADDR_CHANGE', 'F9_00_HD_AMENDED_RETURN', 'F9_00_HD_CTRY_OF_DOMICILE', 'F9_00_HD_EXEMPT_STATUS_4847A1', 'F9_00_HD_EXEMPT_STATUS_501C', 'F9_00_HD_EXEMPT_STATUS_501C3', 'F9_00_HD_FINAL_RETURN', 'F9_00_HD_GROSS_EXEMPT_NUM', 'F9_00_HD_GROSS_RCPT', 'F9_00_HD_GROUP_RETURN', 'F9_00_HD_INCLUDES_SUBORD_ORGS', 'F9_00_HD_INITIAL_RETURN', 'F9_00_HD_PRIN_OFF_NAME', 'F9_00_HD_SIGNING_OFFICER_SIGNTR', 'F9_00_HD_SPECIAL_CONDITION_DESC', 'F9_00_HD_STATE_OF_DOMICILE', 'F9_00_HD_TAX_PER_END', 'F9_00_HD_TAX_YEAR', 'F9_00_HD_TIME_STAMP', 'F9_00_HD_TYPE_ORG_ASSOCIATION', 'F9_00_HD_TYPE_ORG_CORP', 'F9_00_HD_TYPE_ORG_OTHER', 'F9_00_HD_TYPE_ORG_OTHER_DESC', 'F9_00_HD_TYPE_ORG_TRUST', 'F9_00_HD_WEBSITE', 'F9_00_HD_YEAR_FORMED', 'F9_01_PC_BEN_PAID_MEMB_PRIOR', 'F9_01_PC_CONTR_GRANTS_CURR', 'F9_01_PC_CONTR_GRANTS_PRIOR', 'F9_01_PC_GRANTS_PRIOR', 'F9_01_PC_INDEP_

In [197]:
print(len(new_variables_df['variable_name_new'].tolist()))

193


In [199]:
set([c for c in df.columns.tolist() if c not in flat_list]) - set(new_variables_df['variable_name_new'].tolist())

{'501c3', 'DLN', 'EIN', 'Filer', 'OrganizationName', 'URL', 'fiscal_year'}

<br>The following block drops 324 columns

In [200]:
print(len(df.columns))
df = df[[c for c in df.columns.tolist() if c not in flat_list]]
print(len(df.columns))
df[:2]

523
199


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,F9_09_PC_FEES_FOR_SVCE_FR_TOT,fiscal_year,Filer,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,501c3
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,,2010,"{'EIN': '232705170', 'Name': {'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}, 'NameControl': 'RONA', 'Phone': '8565826843', 'USAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300...",2016-02-24 21:20:13Z,1.0,,,,,1,,,1473903,0,,,MICHAEL ANTON,"{'Name': 'ROBERT TRAA', 'Title': 'TREASURER', 'Phone': '8565826843', 'DateSigned': '2011-11-04', 'AuthorizeThirdParty': '1'}",,PA,2010-12-31,2010,2011-11-09T06:41:09-06:00,,1,,,,,1992,0.0,1439340,1044925.0,638637.0,10,30447,1753405,243131,0.0,0,0.0,0,89152,193604,,2440859,881768,195892,0,0.0,450430,1075372,0,0.0,10,0,925000,33563,1990429,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,459751,1000,0,0,0,1925215,1384751,171810,1473903,0,0,0,1,1,0,0,1,0,0,0,0,0,0,,1,0,,0,0,0,1,1,1,10,10,0,0,,,,"[PA, NJ, DE]",0,0,0,1.0,0,0,0,0,0.0,0,1439340.0,,,,,,,,,,,,,,,1439340,1000,,"{'TotalRevenueColumn': '1473903', 'RelatedOrExemptFunctionIncome': '1000', 'UnrelatedBusinessRevenue': '0', 'ExclusionAmount': '33563'}",,,,,,,,,"{'Total': '21675', 'ManagementAndGeneral': '21675'}",,"{'Total': '215', 'ManagementAndGeneral': '215'}",,,,,,,,,,,,,,,,,,,,"{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}","{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}","{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}","{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}",,,,256845,86228,,1,,"{'BOY': '51640', 'EOY': '240077'}",,"{'BOY': '332660', 'EOY': '270700'}","{'BOY': '332660', 'EOY': '270700'}",,,,"{'BOY': '1925215', 'EOY': '2440859'}",,,,89152,,0,1,,,1,,0,1,1
1,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,201106,581805618,{'Total': '0'},2011,"{'EIN': '581805618', 'Name': {'BusinessNameLine1': 'TORRINGTON VOA ELDERLY HOUSING INC', 'BusinessNameLine2': 'BELL PARK TOWER'}, 'NameControl': 'TORR', 'Phone': '7033415000', 'USAddress': {'AddressLine1': '1660 DUKE STREET', 'City': 'ALEXANDRIA'...",2016-02-24 21:20:13Z,,,,,,1,,1736.0,266420,0,0.0,,,"{'Name': 'THOMAS D TURNBULL', 'Title': 'ASST. SEC/TREAS', 'DateSigned': '2011-11-09'}",,WY,2011-06-30,2010,2011-11-09T07:32:06-08:00,,1,,,,,1993,,0,,,13,1425,1437850,189785,,0,,222839,-39085,-36926,,1433342,261190,0,0,,34577,224264,0,,19,0,0,828,1398765,PROVIDE HOUSING FOR THE ELDERLY AND THE DISABLED UNDER SECTION 202 OF THE NATIONAL HOUSING ACT UNDER AN AGREEMENT WITH THE DEPARTMENT OF HUD.,222550,0,265592,82955,71405,1455332,305505,17482,266420,0,0,0,1,1,1,0,1,1,1,1,0,1,1,,1,0,0.0,0,0,0,1,1,1,13,19,0,1,,,0.0,,0,0,1,,0,0,1,411648,,1180355,,,,,,,,,,,,,,265592.0,,0,0,265592.0,"{'TotalRevenueColumn': '266420', 'RelatedOrExemptFunctionIncome': '266420'}",{'Total': '0'},{'Total': '0'},{'Total': '0'},{'Total': '0'},{'Total': '0'},{'Total': '0'},{'Total': '0'},{'Total': '0'},"{'Total': '7500', 'ManagementAndGeneral': '7500'}",{'Total': '0'},{'Total': '0'},{'Total': '0'},"{'Total': '21600', 'ManagementAndGeneral': '21600'}",{'Total': '0'},"{'Total': '17714', 'ProgramServices': '17714'}","{'Total': '17714', 'ProgramServices': '17714'}","{'Total': '17714', 'ProgramServices': '17714'}","{'Total': '17714', 'ProgramServices': '17714'}","{'Total': '59440', 'ProgramServices': '59440'}","{'Total': '59440', 'ProgramServices': '59440'}","{'Total': '59440', 'ProgramServices': '59440'}","{'Total': '59440', 'ProgramServices': '59440'}","{'Total': '5801', 'ProgramServices': '5801'}","{'Total': '5801', 'ProgramServices': '5801'}","{'Total': '5801', 'ProgramServices': '5801'}","{'Total': '5801', 'ProgramServices': '5801'}",{'Total': '0'},{'Total': '0'},{'Total': '0'},{'Total': '0'},"{'Total': '305505', 'ProgramServices': '276405', 'ManagementAndGeneral': '29100', 'Fundraising': '0'}","{'Total': '305505', 'ProgramServices': '276405', 'ManagementAndGeneral': '29100', 'Fundraising': '0'}","{'Total': '305505', 'ProgramServices': '276405', 'ManagementAndGeneral': '29100', 'Fundraising': '0'}","{'Total': '305505', 'ProgramServices': '276405', 'ManagementAndGeneral': '29100', 'Fundraising': '0'}",,"{'BOY': '250', 'EOY': '22261'}","{'BOY': '250', 'EOY': '22261'}",2187206,904332,,1,,"{'BOY': '9203', 'EOY': '11349'}",,{'EOY': '0'},{'EOY': '0'},"{'BOY': '6219', 'EOY': '7035'}",,,"{'BOY': '1455332', 'EOY': '1433342'}",,,,-39085,,0,1,,,1,1.0,1,1,1


##### Verify

In [214]:
print(len(df.columns.tolist()))

199


In [215]:
set(df.columns.tolist()) - set(new_variables_df['variable_name_new'].tolist())

{'501c3', 'DLN', 'EIN', 'Filer', 'OrganizationName', 'URL', 'fiscal_year'}

In [216]:
set(new_variables_df['variable_name_new'].tolist()) - set(df.columns.tolist())

{'F9_00_HD_FILER_STATE_US'}

##### Save DF

In [217]:
%%time
df.to_pickle('all filings nov. 2020 - all control variables (renamed).pkl.gz', compression='gzip')

Wall time: 34min 19s


In [219]:
print('# of columns:', len(df.columns))
print('# of observations:', len(df))
df[:1]

# of columns: 199
# of observations: 1895016


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,F9_09_PC_FEES_FOR_SVCE_FR_TOT,fiscal_year,Filer,F9_00_HD_BUILD_TIME_STAMP,F9_00_HD_ADDR_CHANGE,F9_00_HD_AMENDED_RETURN,F9_00_HD_CTRY_OF_DOMICILE,F9_00_HD_EXEMPT_STATUS_4847A1,F9_00_HD_EXEMPT_STATUS_501C,F9_00_HD_EXEMPT_STATUS_501C3,F9_00_HD_FINAL_RETURN,F9_00_HD_GROSS_EXEMPT_NUM,F9_00_HD_GROSS_RCPT,F9_00_HD_GROUP_RETURN,F9_00_HD_INCLUDES_SUBORD_ORGS,F9_00_HD_INITIAL_RETURN,F9_00_HD_PRIN_OFF_NAME,F9_00_HD_SIGNING_OFFICER_SIGNTR,F9_00_HD_SPECIAL_CONDITION_DESC,F9_00_HD_STATE_OF_DOMICILE,F9_00_HD_TAX_PER_END,F9_00_HD_TAX_YEAR,F9_00_HD_TIME_STAMP,F9_00_HD_TYPE_ORG_ASSOCIATION,F9_00_HD_TYPE_ORG_CORP,F9_00_HD_TYPE_ORG_OTHER,F9_00_HD_TYPE_ORG_OTHER_DESC,F9_00_HD_TYPE_ORG_TRUST,F9_00_HD_WEBSITE,F9_00_HD_YEAR_FORMED,F9_01_PC_BEN_PAID_MEMB_PRIOR,F9_01_PC_CONTR_GRANTS_CURR,F9_01_PC_CONTR_GRANTS_PRIOR,F9_01_PC_GRANTS_PRIOR,F9_01_PC_INDEP_VOTING_MEMB,F9_01_PC_INVEST_INCOME_PRIOR,F9_01_PC_NET_ASSETS_BOY,F9_01_PC_OTHER_EXPENSE_PRIOR,F9_01_PC_OTHER_REV_PRIOR,F9_01_PC_PROF_FUNDRISING_EXP_CURR,F9_01_PC_PROF_FUNDRISING_EXP_PRIOR,F9_01_PC_PROG_SERVICE_REV_PRIOR,F9_01_PC_REV_LESS_EXP_CURR,F9_01_PC_REV_LESS_EXP_PRIOR,F9_01_PC_TERMINATION_CONTRACTION,F9_01_PC_TOT_ASSETS_EOY,F9_01_PC_TOT_EXP_PRIOR,F9_01_PC_TOT_FNDR_EXP_CURR,F9_01_PC_TOT_INDIV_EMPLOYED,F9_01_PC_TOT_INDIV_VOLUNTEERS,F9_01_PC_TOT_LIABILITIES_EOY,F9_01_PC_TOT_REVENUE_PRIOR,F9_01_PC_TOT_UBI_GROSS,F9_01_PC_TOT_UBI_NET,F9_01_PC_VOTING_MEMB_GOV_BODY,F9_01_PZ_BEN_PAID_TO_MEMB_CURR,F9_01_PZ_GRANTS_PAID_CURR,F9_01_PZ_INVEST_INCOME_CURR,F9_01_PZ_NAFB_EOY,F9_01_PZ_ORGANIZATIONAL_MISSION,F9_01_PZ_OTHER_EXPENSE_CURR,F9_01_PZ_OTHER_REV_CURR,F9_01_PZ_PROG_SERVICE_REV_CURR,F9_01_PZ_SALARIES_CURR,F9_01_PZ_SALARIES_PRIOR,F9_01_PZ_TOT_ASSETS_BOY,F9_01_PZ_TOT_EXP_CURR,F9_01_PZ_TOT_LIAB_BOY,F9_01_PZ_TOT_REV_CURR,F9_04_PC_FR_EVENT_INC_GT_15K,F9_04_PC_GAMING_INC_GT_15K,F9_04_PC_PROF_FR_EXP_GT_15K,F9_06_PC_990_PROVIDED_GOV_BODY,F9_06_PC_ANNUAL_DISC_COVRD_PERS,F9_06_PC_CEO_COMPENSTN_PROCESS,F9_06_PC_CHANGES_ORGANIZING_DOCS,F9_06_PC_CONFLICT_OF_INTEREST,F9_06_PC_DECISIONS_SUBJ_APPROVAL,F9_06_PC_DELEGATION_MGT_DUTIES,F9_06_PC_DELEGATION_OF_MGT,F9_06_PC_DOCUMENT_RET_POLICY,F9_06_PC_ELECTION_BOARD_MEMBERS,F9_06_PC_FAMILY_OR_BUSINESS_REL,F9_06_PC_FORM_AVAIL_OWN_WEBSITE,F9_06_PC_FORM_UPON_REQUEST,F9_06_PC_JOINT_VENTURE_INVESTMNT,F9_06_PC_JOINT_VENTURE_POLICY,F9_06_PC_LOCAL_CHAPTERS,F9_06_PC_MATERIAL_DIVERSION,F9_06_PC_MEMBERS_OR_STOCKHOLDERS,F9_06_PC_MINUTES_COMMITTEES,F9_06_PC_MINUTES_GOVERNING_BODY,F9_06_PC_MONITORING_OF_COI_POLICY,F9_06_PC_NUM_IND_VOTING_MEMBERS,F9_06_PC_NUM_VOTING_GOV_MEMBERS,F9_06_PC_OFFICER_MAILING_ADDRESS,F9_06_PC_OTHER_COMPENSTN_PROCESS,F9_06_PC_OTHER_WEBSITE,F9_06_PC_OWN_WEBSITE,F9_06_PC_POLICIES_GOVERN_CHAPTER,F9_06_PC_STATES_WHERE_RET_FILED,F9_06_PC_WHISTLEBLOWER_POLICY,F9_07_PC_COMPENSATION_OTHER_SRCE,F9_07_PC_FORMER_OFFICER_LISTED,F9_07_PC_NO_LISTED_PERS_COMPENSD,F9_07_PC_NUM_CONTRCTRS_GRTR_100K,F9_07_PC_NUM_INDS_GREATER_100K,F9_07_PC_TOTAL_COMP_GRTR_150K,F9_07_PC_TOT_OTHER_COMPENSATION,F9_07_PC_TOT_REPRT_COMP_FROM_ORG,F9_07_PC_TOT_REPRT_COMP_RLTD_ORG,F9_08_PC_ALL_OTHER_CONTRIBUTIONS,F9_08_PC_CONTS_REPRTD_FNDRAISNG,F9_08_PC_COST_OF_GOODS_SOLD,F9_08_PC_FEDERATED_CAMPAIGNS,F9_08_PC_FUNDRAISING_DIRECT_EXP,F9_08_PC_FUNDRAISING_EVENTS,F9_08_PC_FUNDRAISING_GROSS_INC,F9_08_PC_GAMING_DIRECT_EXPENSES,F9_08_PC_GAMING_GROSS_INCOME,F9_08_PC_GOVERNMENT_GRANTS,F9_08_PC_GROSS_SALES_INVENTORY,F9_08_PC_MEMBERSHIP_DUES,F9_08_PC_NONCASH_CONTRIBUTIONS,F9_08_PC_PROGRAM_SVCE_REV_TOTAL,F9_08_PC_RELATED_ORGANIZATIONS,F9_08_PC_TOTAL_CONTRIBUTIONS,F9_08_PC_TOTAL_OTHER_REVENUE,F9_08_PC_TOTAL_PROG_SVCE_REVENUE,F9_08_PC_TOTAL_REVENUE,F9_09_PC_COMP_DISQUAL_FUNDRAISE,F9_09_PC_COMP_DISQUAL_MGMT,F9_09_PC_COMP_DISQUAL_PROG_SVCE,F9_09_PC_COMP_DISQUAL_TOTAL,F9_09_PC_COMP_OFFICERS_FUNDRAISE,F9_09_PC_COMP_OFFICERS_MGMT,F9_09_PC_COMP_OFFICERS_PROG_SVCE,F9_09_PC_COMP_OFFICERS_TOTAL,F9_09_PC_FEES_FOR_SVCE_ACCT_TOT,F9_09_PC_FEES_FOR_SVCE_INVST_TOT,F9_09_PC_FEES_FOR_SVCE_LEGL_TOT,F9_09_PC_FEES_FOR_SVCE_LOBB_TOT,F9_09_PC_FEES_FOR_SVCE_MGMT_TOT,F9_09_PC_FEES_FOR_SVCE_OTH_TOT,F9_09_PC_OTHER_EMP_BEN_FUNDRAISE,F9_09_PC_OTHER_EMP_BEN_MGMT,F9_09_PC_OTHER_EMP_BEN_PROG_SVCE,F9_09_PC_OTHER_EMP_BEN_TOTAL,F9_09_PC_OTHER_SALARY_FUNDRAISE,F9_09_PC_OTHER_SALARY_MGMT,F9_09_PC_OTHER_SALARY_PROG_SVCE,F9_09_PC_OTHER_SALARY_TOTAL,F9_09_PC_PAYROLL_TAX_FUNDRAISE,F9_09_PC_PAYROLL_TAX_MGMT,F9_09_PC_PAYROLL_TAX_PROG_SVCE,F9_09_PC_PAYROLL_TAX_TOTAL,F9_09_PC_PENSION_CONT_FUNDRAISE,F9_09_PC_PENSION_CONT_MGMT,F9_09_PC_PENSION_CONT_PROG_SVCE,F9_09_PC_PENSION_CONT_TOTAL,F9_09_PC_TOTAL_FUNC_EXPENSES,F9_09_PC_TOTAL_FUNDRAISE_EXPENSE,F9_09_PC_TOTAL_MGMT_EXPENSE,F9_09_PC_TOTAL_PROG_SVCE_EXPENSE,F9_10_PC_BOND_LIABILITIES_EOY,F9_10_PC_CASH_NON_INTEREST_BOY,F9_10_PC_CASH_NON_INTEREST_EOY,F9_10_PC_LAND_BLDG_EQPMT,F9_10_PC_LAND_BLDG_EQPMT_DEPRCTN,F9_10_PC_LOANS_FROM_OFFICERS_EOY,F9_10_PC_ORG_FOLLOWS_SFAS117,F9_10_PC_ORG_NOT_FOLLOW_SFAS117,F9_10_PC_OTHER_LIABILITIES_EOY,F9_10_PC_RET_EARNINGS_ENDWMT_EOY,F9_10_PC_SAVINGS_TEMP_INVEST_BOY,F9_10_PC_SAVINGS_TEMP_INVEST_EOY,F9_10_PC_SECURED_MORTGAGES_EOY,F9_10_PC_UNSECURED_NOTES_BOY,F9_10_PC_UNSECURED_NOTES_EOY,F9_10_PZ_TOTAL_ASSETS_EOY,F9_11_PC_RECNCLTN_DONATED_SVCES,F9_11_PC_RECNCLTN_INVSTMNT_EXP,F9_11_PC_RECNCLTN_PRIOR_PER_ADJ,F9_11_PC_RECNCLTN_REV_LESS_EXP,F9_11_PC_RECNCLTN_UNRLZD_GAIN,F9_12_PC_ACCNT_COMPILE_OR_REVIEW,F9_12_PC_ACCTG_METHOD_ACCRUAL,F9_12_PC_ACCTG_METHOD_CASH,F9_12_PC_ACCTG_METHOD_OTHER,F9_12_PC_AUDIT_COMMITTEE,F9_12_PC_FED_GRNT_AUDIT_PERFORMD,F9_12_PC_FED_GRNT_AUDIT_REQUIRED,F9_12_PC_FINCL_STMTS_AUDITED,501c3
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,,2010,"{'EIN': '232705170', 'Name': {'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}, 'NameControl': 'RONA', 'Phone': '8565826843', 'USAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300...",2016-02-24 21:20:13Z,1,,,,,1,,,1473903,0,,,MICHAEL ANTON,"{'Name': 'ROBERT TRAA', 'Title': 'TREASURER', 'Phone': '8565826843', 'DateSigned': '2011-11-04', 'AuthorizeThirdParty': '1'}",,PA,2010-12-31,2010,2011-11-09T06:41:09-06:00,,1,,,,,1992,0,1439340,1044925,638637,10,30447,1753405,243131,0,0,0,0,89152,193604,,2440859,881768,195892,0,0,450430,1075372,0,0,10,0,925000,33563,1990429,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,459751,1000,0,0,0,1925215,1384751,171810,1473903,0,0,0,1,1,0,0,1,0,0,0,0,0,0,,1,0,,0,0,0,1,1,1,10,10,0,0,,,,"[PA, NJ, DE]",0,0,0,1,0,0,0,0,0,0,1439340,,,,,,,,,,,,,,,1439340,1000,,"{'TotalRevenueColumn': '1473903', 'RelatedOrExemptFunctionIncome': '1000', 'UnrelatedBusinessRevenue': '0', 'ExclusionAmount': '33563'}",,,,,,,,,"{'Total': '21675', 'ManagementAndGeneral': '21675'}",,"{'Total': '215', 'ManagementAndGeneral': '215'}",,,,,,,,,,,,,,,,,,,,"{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}","{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}","{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}","{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}",,,,256845,86228,,1,,"{'BOY': '51640', 'EOY': '240077'}",,"{'BOY': '332660', 'EOY': '270700'}","{'BOY': '332660', 'EOY': '270700'}",,,,"{'BOY': '1925215', 'EOY': '2440859'}",,,,89152,,0,1,,,1,,0,1,1
