# Overview

This is the third in a series of tutorials that illustrate how to download, extract, and parse the IRS 990 e-file data available at https://aws.amazon.com/public-data-sets/irs-990/

In the previous notebook we downloaded all 990 filings into a MongoDB database. The goal of this notebook is to extract the JSON data into a Python PANDAS dataset, which will be our dataset of choice for all future analyses. 

The 990 e-file data contains myriad variables, each of which has to be verified before extracting and analyzing. Working with Jesse Lecy at Arizona State and others, a group of us has come up with a "concordance" file containing the *xpath* of all verified variables. Among other things, this concordance file maps the specific lines from the Form 990 to the xpaths in the XML file. Accordingly, our first step will be to read in the concordance file that has **_all_** reconciled and verified variables to date:
- The file is called *concordance_VERIFIED.xlsx*

I then connect with the *MongoDB* database and import all verified variables into a PANDAS dataframe. 

I then also 'flatten' the *ReturnHeader* column (and delete unneeded *ReturnHeader* columns).

Lastly, I save the following file:
- *all filings August 2022 - all control variables.pkl.gz*

I following notebooks I will combine and rename columns, binarize variables, etc. 

# Load Packages and Connect to MongoDB

First, we will add a datestamp and then import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations. It is invaluable for analyzing datasets. 

In [30]:
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

Current date and time :  2022-08-02 13:26:28


In [2]:
import sys
import time
import json

In [3]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
We can check which version of various packages we're using. You can see I'm running PANDAS 1.4.1 here.

In [4]:
print(pd.__version__)

1.4.1


<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [5]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)

#### Set working directory

In [6]:
cd "C:\\Users\\Gregory\\IRS 990 Control Variables\\"

C:\Users\Gregory\IRS 990 Control Variables


#### MongoDB
Depending on the project, I will store the data in SQLite or MongoDB. With this project I'm using MongoDB -- it's great for storing JSON data where each observation could have different variables. Before we get to the interesting part the following code blocks set up the MongoDB environment and the new database we'll be using. 

**_Note:_** In a terminal you'll have to start MongoDB by running the command *mongod* or *sudo mongod*. Then we run the following code block here to access MongoDB.

In [7]:
import pymongo
from pymongo import MongoClient
client = MongoClient()

In [8]:
print(pymongo.__version__)

4.0.2


<br>Get a list of all databases

In [9]:
MongoClient().list_database_names()

['ICIJ',
 'OWS',
 'SMC',
 'admin',
 'cashtags',
 'config',
 'irs_990_db',
 'irs_990_db_v2',
 'local',
 'paradisepapers',
 'sec',
 'sp1500',
 'sp500']

##### Let's define the database and collection/table we created in the previous notebook for storing the 990 filings.

In [10]:
# DEFINE MY mongoDB DATABASE
db = client['irs_990_db']

# DEFINE MY COLLECTION HOUSING 990 DATA
filings_990 = db['filings_990']

<br>When we set up our database in an earlier tutorial, we set a unique constraint on the collection based on *URL*. This averted duplicates from being inserted. Uncomment following line if index not yet created.

In [15]:
#db.filings_990.create_index([('URL', pymongo.ASCENDING)], unique=True)

<br>Show the index. We can see that as expected *URL* is an index item.

In [13]:
list(db.filings_990.index_information())

['_id_', 'URL_1']

<br>Check how many observations in the database table.

In [85]:
filings_990.estimated_document_count()

2192435

<br>Show one filing in the database. You can see here the data is in JSON format. In this notebook we will be converting these filings to a typical 'flat' (one variable per column) database.

In [15]:
db.filings_990.find_one({'URL' : "https://s3.amazonaws.com/irs-form-990/201100289349300910_public.xml" })

{'_id': ObjectId('5d01cfed78ffca27b428aa97'),
 'OrganizationName': 'ASSEMBLEIA DE DEUS MINISTERIO BELEM CHUR',
 'ObjectId': '201100289349300910',
 'URL': 'https://s3.amazonaws.com/irs-form-990/201100289349300910_public.xml',
 'SubmittedOn': '2011-09-22',
 'DLN': '93493028009101',
 'LastUpdated': '2016-03-21T17:23:53',
 'TaxPeriod': '201012',
 'FormType': '990',
 'EIN': '954745380',
 '@xmlns': 'http://www.irs.gov/efile',
 '@returnVersion': '2010v3.2',
 'ReturnHeader': {'@binaryAttachmentCount': '0',
  'Timestamp': '2011-01-28T13:07:07-08:00',
  'TaxPeriodEndDate': '2010-12-31',
  'PreparerFirm': {'PreparerFirmBusinessName': {'BusinessNameLine1': 'VIRULAS GENERAL OFFICE'},
   'PreparerFirmUSAddress': {'AddressLine1': '4138 ATLANTIC AVE',
    'City': 'Long Beach',
    'State': 'CA',
    'ZIPCode': '90807'}},
  'ReturnType': '990',
  'TaxPeriodBeginDate': '2010-01-01',
  'Filer': {'EIN': '954745380',
   'Name': {'BusinessNameLine1': 'ASSEMBLEIA DE DEUS MINISTERIO BELEM CHUR'},
   'NameCont

<br>We can also just show the 'keys', or variable names, for one filing. You can see the huge number of variables available.

In [24]:
for f in filings_990.find({})[:1]:
    print(sorted(f.keys()))

['@documentCount', '@documentId', '@referenceDocumentId', '@returnVersion', '@xmlns', '@xmlns:xsi', '@xsi:schemaLocation', 'AccountantCompileOrReview', 'AccountsPayableAccruedExpenses', 'AccountsReceivable', 'ActivitiesConductedPartnership', 'ActivityOrMissionDescription', 'AddressChange', 'AddressPrincipalOfficerUS', 'AllOtherContributions', 'AllOtherExpenses', 'AnnualDisclosureCoveredPersons', 'AuditCommittee', 'BenefitsPaidToMembersCY', 'BenefitsPaidToMembersPriorYear', 'BsnssRltnshpThruFamilyMember', 'BsnssRltnshpWithOrganization', 'ChangesToOrganizingDocs', 'CollectionsOfArt', 'CompensationFromOtherSources', 'CompensationProcessCEO', 'CompensationProcessOther', 'ComplianceWithBackupWitholding', 'ConflictOfInterestPolicy', 'ConservationEasements', 'ConsolidatedAuditFinancialStmt', 'ContributionsGrantsCurrentYear', 'ContributionsGrantsPriorYear', 'CreditCounseling', 'DLN', 'DecisionsSubjectToApproval', 'DeductibleContributionsOfArt', 'DeductibleNonCashContributions', 'DelegationOfMa

# Read in Concordance File
We are going to read in a 'concordance' file. In this notebook we are interested in the *xpaths* for these variables -- in general, each 990 variable will have two different *xpaths* that vary according to year. These *xpaths* allow us to identify the location of the variables in each filing. In a following notebook, we will be using the *new_variable_name* field as our variable name. There are other relevant columns in the concordance file, which we'll cover in subsequent notebooks. 

In [16]:
concordance = pd.read_excel('concordance_VERIFIED.xlsx')
print('# of columns:', len(concordance.columns))
print('# of observations:', len(concordance))
concordance[:2]

# of columns: 17
# of observations: 574


Unnamed: 0,xpath,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,python_data_type,fill_null,BINARIZE,MongoDB_Name,sub_key,sub_sub_key,cardinality
0,/Return/ReturnData/IRS990/SpecialConditionDesc,F9_00_HD_SPECIAL_CONDITION_DESC,,,,,Special condition description,F990-PC-PART-00,PART-00,TextType,string,Do not fill null,,SpecialConditionDesc,,,
1,/Return/ReturnData/IRS990/SpecialConditionDescription,F9_00_HD_SPECIAL_CONDITION_DESC,31.0,,,,Special condition description,F990-PC-PART-00,PART-00,TextType,string,Do not fill null,,SpecialConditionDescription,,,


<br>Check *MongoDB_Name*. This name was taken from the *xpath* column.

In [17]:
print(len(concordance['MongoDB_Name'].tolist()))
print(len(set(concordance['MongoDB_Name'].tolist())))

574
480


<br>Create a list, ``mongo_cols``, that contains a list of all variable names in the concordance files. There are 480 different variables. These are our target variables -- the ones we'll extract for each filing from MongoDB.

In [18]:
mongo_cols = concordance[:]['MongoDB_Name'].tolist()
print(len(mongo_cols))
print(len(set(mongo_cols)))
mongo_cols = list(set(mongo_cols))
print(len(mongo_cols))
print(mongo_cols[:5])

574
480
480
['TotalOfOtherProgramServiceExp', 'FederatedCampaigns', 'MortgNotesPyblScrdInvstPropGrp', 'PYTotalRevenueAmt', 'Timestamp']


# Extract Data from MongoDB Databse

Print out a sorted list of our desired columns. Here we do a 'list comprehension' to remove ``nan`` values from the list we just created. 

In [20]:
mongo_cols = [x for x in mongo_cols if str(x) != 'nan']
print(len(mongo_cols))

480


In [21]:
print(len(sorted(mongo_cols)))

480


<br>Use 'helper' loop to print out variables for MongoDB -- we'll copy and paste this into a subsequent block of code.

In [22]:
for c in sorted(mongo_cols):
    print("    '"+c+"'"+': 1, ')

    'AccountantCompileOrReview': 1, 
    'AccountantCompileOrReviewInd': 1, 
    'AccountsPayableAccrExpnssGrp': 1, 
    'AccountsPayableAccruedExpenses': 1, 
    'AccountsReceivable': 1, 
    'AccountsReceivableGrp': 1, 
    'ActivitiesConductedPartnership': 1, 
    'ActivitiesConductedPrtshpInd': 1, 
    'Activity2': 1, 
    'Activity3': 1, 
    'ActivityCd': 1, 
    'ActivityCode': 1, 
    'ActivityOrMissionDesc': 1, 
    'ActivityOrMissionDescription': 1, 
    'ActivityOther': 1, 
    'AddressChange': 1, 
    'AddressChangeInd': 1, 
    'Advertising': 1, 
    'AdvertisingGrp': 1, 
    'AllAffiliatesIncluded': 1, 
    'AllAffiliatesIncludedInd': 1, 
    'AllOtherContributions': 1, 
    'AllOtherContributionsAmt': 1, 
    'AllOtherExpenses': 1, 
    'AllOtherExpensesGrp': 1, 
    'AmendedReturn': 1, 
    'AmendedReturnInd': 1, 
    'AnnualDisclosureCoveredPersons': 1, 
    'AnnualDisclosureCoveredPrsnInd': 1, 
    'AuditCommittee': 1, 
    'AuditCommitteeInd': 1, 
    'BenefitsPaidTo

<br>The 'helper' function output we created above is copy-and-pasted in here into a variable called ``cursor``. Note that in the first row we also include five identifier columns (*EIN, OrganizationName, DLN*, *URL*, and *ReturnHeader*). We also include *_id* (a MongoDB column) with a '0' tag, meaning we don't want this otherwise automatically included column.

In [26]:
cursor = filings_990.find({}, {'_id': 0, 'EIN': 1, 'OrganizationName': 1, 'DLN': 1, 'URL': 1,  'ReturnHeader': 1,
    'AccountantCompileOrReview': 1, 
    'AccountantCompileOrReviewInd': 1, 
    'AccountsPayableAccrExpnssGrp': 1, 
    'AccountsPayableAccruedExpenses': 1, 
    'AccountsReceivable': 1, 
    'AccountsReceivableGrp': 1, 
    'ActivitiesConductedPartnership': 1, 
    'ActivitiesConductedPrtshpInd': 1, 
    'Activity2': 1, 
    'Activity3': 1, 
    'ActivityCd': 1, 
    'ActivityCode': 1, 
    'ActivityOrMissionDesc': 1, 
    'ActivityOrMissionDescription': 1, 
    'ActivityOther': 1, 
    'AddressChange': 1, 
    'AddressChangeInd': 1, 
    'Advertising': 1, 
    'AdvertisingGrp': 1, 
    'AllAffiliatesIncluded': 1, 
    'AllAffiliatesIncludedInd': 1, 
    'AllOtherContributions': 1, 
    'AllOtherContributionsAmt': 1, 
    'AllOtherExpenses': 1, 
    'AllOtherExpensesGrp': 1, 
    'AmendedReturn': 1, 
    'AmendedReturnInd': 1, 
    'AnnualDisclosureCoveredPersons': 1, 
    'AnnualDisclosureCoveredPrsnInd': 1, 
    'AuditCommittee': 1, 
    'AuditCommitteeInd': 1, 
    'BenefitsPaidToMembersCY': 1, 
    'BenefitsPaidToMembersPriorYear': 1, 
    'BenefitsToMembers': 1, 
    'BenefitsToMembersGrp': 1, 
    'BuildTS': 1, 
    'BusinessOfficerGrp': 1, 
    'CYBenefitsPaidToMembersAmt': 1, 
    'CYContributionsGrantsAmt': 1, 
    'CYGrantsAndSimilarPaidAmt': 1, 
    'CYInvestmentIncomeAmt': 1, 
    'CYOtherExpensesAmt': 1, 
    'CYOtherRevenueAmt': 1, 
    'CYProgramServiceRevenueAmt': 1, 
    'CYRevenuesLessExpensesAmt': 1, 
    'CYSalariesCompEmpBnftPaidAmt': 1, 
    'CYTotalExpensesAmt': 1, 
    'CYTotalFundraisingExpenseAmt': 1, 
    'CYTotalProfFndrsngExpnsAmt': 1, 
    'CYTotalRevenueAmt': 1, 
    'CashNonInterestBearing': 1, 
    'CashNonInterestBearingGrp': 1, 
    'ChangeToOrgDocumentsInd': 1, 
    'ChangesToOrganizingDocs': 1, 
    'CntrbtnsRprtdFundraisingEvents': 1, 
    'CntrctRcvdGreaterThan100KCnt': 1, 
    'CompCurrentOfcrDirectorsGrp': 1, 
    'CompCurrentOfficersDirectors': 1, 
    'CompDisqualPersons': 1, 
    'CompDisqualPersonsGrp': 1, 
    'CompensationFromOtherSources': 1, 
    'CompensationFromOtherSrcsInd': 1, 
    'CompensationProcessCEO': 1, 
    'CompensationProcessCEOInd': 1, 
    'CompensationProcessOther': 1, 
    'CompensationProcessOtherInd': 1, 
    'ConferencesMeetings': 1, 
    'ConferencesMeetingsGrp': 1, 
    'ConflictOfInterestPolicy': 1, 
    'ConflictOfInterestPolicyInd': 1, 
    'ContractTerminationInd': 1, 
    'ContriRptFundraisingEventAmt': 1, 
    'ContributionsGrantsCurrentYear': 1, 
    'ContributionsGrantsPriorYear': 1, 
    'CostOfGoodsSold': 1, 
    'CostOfGoodsSoldAmt': 1, 
    'CountryLegalDomicile': 1, 
    'DecisionsSubjectToApprovaInd': 1, 
    'DecisionsSubjectToApproval': 1, 
    'DeferredRevenue': 1, 
    'DeferredRevenueGrp': 1, 
    'DelegationOfManagementDuties': 1, 
    'DelegationOfMgmtDutiesInd': 1, 
    'DepreciationDepletion': 1, 
    'DepreciationDepletionGrp': 1, 
    'Desc': 1, 
    'Description': 1, 
    'DisregardedEntity': 1, 
    'DisregardedEntityInd': 1, 
    'DoNotFollowSFAS117': 1, 
    'DocumentRetentionPolicy': 1, 
    'DocumentRetentionPolicyInd': 1, 
    'DonatedServicesAndUseFcltsAmt': 1, 
    'ElectionOfBoardMembers': 1, 
    'ElectionOfBoardMembersInd': 1, 
    'EmployeeCnt': 1, 
    'EngagedInExcessBenefitTransInd': 1, 
    'EscrowAccountLiability': 1, 
    'EscrowAccountLiabilityGrp': 1, 
    'ExcessBenefitTransaction': 1, 
    'Expense': 1, 
    'ExpenseAmt': 1, 
    'FSAudited': 1, 
    'FSAuditedInd': 1, 
    'FamilyOrBusinessRelationship': 1, 
    'FamilyOrBusinessRlnInd': 1, 
    'FederalGrantAuditPerformed': 1, 
    'FederalGrantAuditPerformedInd': 1, 
    'FederalGrantAuditRequired': 1, 
    'FederalGrantAuditRequiredInd': 1, 
    'FederatedCampaigns': 1, 
    'FederatedCampaignsAmt': 1, 
    'FeesForServicesAccounting': 1, 
    'FeesForServicesAccountingGrp': 1, 
    'FeesForServicesInvstMgmntFees': 1, 
    'FeesForServicesLegal': 1, 
    'FeesForServicesLegalGrp': 1, 
    'FeesForServicesLobbying': 1, 
    'FeesForServicesLobbyingGrp': 1, 
    'FeesForServicesManagement': 1, 
    'FeesForServicesManagementGrp': 1, 
    'FeesForServicesOther': 1, 
    'FeesForServicesOtherGrp': 1, 
    'FeesForServicesProfFundraising': 1, 
    'FeesForSrvcInvstMgmntFeesGrp': 1, 
    'Filer': 1, 
    'FinalReturnInd': 1, 
    'FollowSFAS117': 1, 
    'ForeignGrants': 1, 
    'ForeignGrantsGrp': 1, 
    'Form990ProvidedToGoverningBody': 1, 
    'Form990ProvidedToGvrnBodyInd': 1, 
    'FormationYr': 1, 
    'FormerOfcrEmployeesListedInd': 1, 
    'FormersListed': 1, 
    'FundraisingActivities': 1, 
    'FundraisingActivitiesInd': 1, 
    'FundraisingAmt': 1, 
    'FundraisingDirectExpenses': 1, 
    'FundraisingDirectExpensesAmt': 1, 
    'FundraisingEvents': 1, 
    'FundraisingGrossIncomeAmt': 1, 
    'Gaming': 1, 
    'GamingActivitiesInd': 1, 
    'GamingDirectExpenses': 1, 
    'GamingDirectExpensesAmt': 1, 
    'GamingGrossIncomeAmt': 1, 
    'GoverningBodyVotingMembersCnt': 1, 
    'GovernmentGrants': 1, 
    'GovernmentGrantsAmt': 1, 
    'GrantAmt': 1, 
    'Grants': 1, 
    'GrantsAndSimilarAmntsCY': 1, 
    'GrantsAndSimilarAmntsPriorYear': 1, 
    'GrantsPayable': 1, 
    'GrantsPayableGrp': 1, 
    'GrantsToDomesticIndividuals': 1, 
    'GrantsToDomesticIndividualsGrp': 1, 
    'GrantsToDomesticOrgs': 1, 
    'GrantsToDomesticOrgsGrp': 1, 
    'GrossIncomeFundraisingEvents': 1, 
    'GrossIncomeGaming': 1, 
    'GrossReceipts': 1, 
    'GrossReceiptsAmt': 1, 
    'GrossSalesOfInventory': 1, 
    'GrossSalesOfInventoryAmt': 1, 
    'GroupExemptionNum': 1, 
    'GroupExemptionNumber': 1, 
    'GroupReturnForAffiliates': 1, 
    'GroupReturnForAffiliatesInd': 1, 
    'IRPDocumentCnt': 1, 
    'IndependentVotingMemberCnt': 1, 
    'IndivRcvdGreaterThan100KCnt': 1, 
    'InfoInScheduleOPartIII': 1, 
    'InfoInScheduleOPartIIIInd': 1, 
    'InfoInScheduleOPartIX': 1, 
    'InfoInScheduleOPartIXInd': 1, 
    'InfoInScheduleOPartV': 1, 
    'InfoInScheduleOPartVI': 1, 
    'InfoInScheduleOPartVII': 1, 
    'InfoInScheduleOPartVIII': 1, 
    'InfoInScheduleOPartVIIIInd': 1, 
    'InfoInScheduleOPartVIIInd': 1, 
    'InfoInScheduleOPartVIInd': 1, 
    'InfoInScheduleOPartVInd': 1, 
    'InfoInScheduleOPartX': 1, 
    'InfoInScheduleOPartXI': 1, 
    'InfoInScheduleOPartXII': 1, 
    'InfoInScheduleOPartXIIInd': 1, 
    'InfoInScheduleOPartXIInd': 1, 
    'InfoInScheduleOPartXInd': 1, 
    'InformationTechnology': 1, 
    'InformationTechnologyGrp': 1, 
    'InitialReturn': 1, 
    'InitialReturnInd': 1, 
    'Insurance': 1, 
    'InsuranceGrp': 1, 
    'IntangibleAssets': 1, 
    'IntangibleAssetsGrp': 1, 
    'Interest': 1, 
    'InterestGrp': 1, 
    'InventoriesForSaleOrUse': 1, 
    'InventoriesForSaleOrUseGrp': 1, 
    'InvestmentExpenseAmt': 1, 
    'InvestmentInJointVenture': 1, 
    'InvestmentInJointVentureInd': 1, 
    'InvestmentIncomeCurrentYear': 1, 
    'InvestmentIncomePriorYear': 1, 
    'InvestmentsOtherSecurities': 1, 
    'InvestmentsOtherSecuritiesGrp': 1, 
    'InvestmentsProgramRelated': 1, 
    'InvestmentsProgramRelatedGrp': 1, 
    'InvestmentsPubTradedSecGrp': 1, 
    'InvestmentsPubTradedSecurities': 1, 
    'LandBldgEquipAccumDeprecAmt': 1, 
    'LandBldgEquipBasisNetGrp': 1, 
    'LandBldgEquipCostOrOtherBssAmt': 1, 
    'LandBldgEquipmentAccumDeprec': 1, 
    'LandBuildingsEquipmentBasis': 1, 
    'LandBuildingsEquipmentBasisNet': 1, 
    'LegalDomicileCountryCd': 1, 
    'LegalDomicileStateCd': 1, 
    'LoansFromOfficersDirectors': 1, 
    'LoansFromOfficersDirectorsGrp': 1, 
    'LobbyingActivities': 1, 
    'LobbyingActivitiesInd': 1, 
    'LocalChapters': 1, 
    'LocalChaptersInd': 1, 
    'MaterialDiversionOrMisuse': 1, 
    'MaterialDiversionOrMisuseInd': 1, 
    'MembersOrStockholders': 1, 
    'MembersOrStockholdersInd': 1, 
    'MembershipDues': 1, 
    'MembershipDuesAmt': 1, 
    'MethodOfAccountingAccrual': 1, 
    'MethodOfAccountingAccrualInd': 1, 
    'MethodOfAccountingCash': 1, 
    'MethodOfAccountingCashInd': 1, 
    'MethodOfAccountingOther': 1, 
    'MethodOfAccountingOtherInd': 1, 
    'MinutesOfCommittees': 1, 
    'MinutesOfCommitteesInd': 1, 
    'MinutesOfGoverningBody': 1, 
    'MinutesOfGoverningBodyInd': 1, 
    'MissionDesc': 1, 
    'MissionDescription': 1, 
    'MortNotesPyblSecuredInvestProp': 1, 
    'MortgNotesPyblScrdInvstPropGrp': 1, 
    'NameOfPrincipalOfficerPerson': 1, 
    'NbrIndependentVotingMembers': 1, 
    'NbrVotingGoverningBodyMembers': 1, 
    'NbrVotingMembersGoverningBody': 1, 
    'NetAssetsOrFundBalancesBOY': 1, 
    'NetAssetsOrFundBalancesBOYAmt': 1, 
    'NetAssetsOrFundBalancesEOY': 1, 
    'NetAssetsOrFundBalancesEOYAmt': 1, 
    'NetUnrelatedBusTxblIncmAmt': 1, 
    'NetUnrelatedBusinessTxblIncome': 1, 
    'NetUnrlzdGainsLossesInvstAmt': 1, 
    'NoListedPersonsCompensated': 1, 
    'NoListedPersonsCompensatedInd': 1, 
    'NoncashContributions': 1, 
    'NoncashContributionsAmt': 1, 
    'NumberFormsTransmittedWith1096': 1, 
    'NumberIndependentVotingMembers': 1, 
    'NumberIndividualsGT100K': 1, 
    'NumberOfContractorsGT100K': 1, 
    'NumberOfEmployees': 1, 
    'Occupancy': 1, 
    'OccupancyGrp': 1, 
    'OfficeExpenses': 1, 
    'OfficeExpensesGrp': 1, 
    'Officer': 1, 
    'OfficerMailingAddress': 1, 
    'OfficerMailingAddressInd': 1, 
    'OrgDoesNotFollowSFAS117Ind': 1, 
    'Organization4947a1': 1, 
    'Organization4947a1NotPFInd': 1, 
    'Organization501c': 1, 
    'Organization501c3': 1, 
    'Organization501c3Ind': 1, 
    'Organization501cInd': 1, 
    'OrganizationFollowsSFAS117Ind': 1, 
    'OthNotesLoansReceivableNetGrp': 1, 
    'OtherAssetsTotal': 1, 
    'OtherAssetsTotalGrp': 1, 
    'OtherEmployeeBenefits': 1, 
    'OtherEmployeeBenefitsGrp': 1, 
    'OtherExpensePriorYear': 1, 
    'OtherExpenses': 1, 
    'OtherExpensesCurrentYear': 1, 
    'OtherExpensesGrp': 1, 
    'OtherLiabilities': 1, 
    'OtherLiabilitiesGrp': 1, 
    'OtherNotesLoansReceivableNet': 1, 
    'OtherOrganizationDsc': 1, 
    'OtherRevenueCurrentYear': 1, 
    'OtherRevenuePriorYear': 1, 
    'OtherRevenueTotalAmt': 1, 
    'OtherSalariesAndWages': 1, 
    'OtherSalariesAndWagesGrp': 1, 
    'OtherWebsite': 1, 
    'OtherWebsiteInd': 1, 
    'OwnWebsite': 1, 
    'OwnWebsiteInd': 1, 
    'PYBenefitsPaidToMembersAmt': 1, 
    'PYContributionsGrantsAmt': 1, 
    'PYExcessBenefitTransInd': 1, 
    'PYGrantsAndSimilarPaidAmt': 1, 
    'PYInvestmentIncomeAmt': 1, 
    'PYOtherExpensesAmt': 1, 
    'PYOtherRevenueAmt': 1, 
    'PYProgramServiceRevenueAmt': 1, 
    'PYRevenuesLessExpensesAmt': 1, 
    'PYSalariesCompEmpBnftPaidAmt': 1, 
    'PYTotalExpensesAmt': 1, 
    'PYTotalProfFndrsngExpnsAmt': 1, 
    'PYTotalRevenueAmt': 1, 
    'PaymentsToAffiliates': 1, 
    'PaymentsToAffiliatesGrp': 1, 
    'PayrollTaxes': 1, 
    'PayrollTaxesGrp': 1, 
    'PensionPlanContributions': 1, 
    'PensionPlanContributionsGrp': 1, 
    'PermanentlyRestrictedNetAssets': 1, 
    'PermanentlyRstrNetAssetsGrp': 1, 
    'PledgesAndGrantsReceivable': 1, 
    'PledgesAndGrantsReceivableGrp': 1, 
    'PoliciesReferenceChapters': 1, 
    'PoliciesReferenceChaptersInd': 1, 
    'PoliticalActivities': 1, 
    'PoliticalCampaignActyInd': 1, 
    'PrepaidExpensesDeferredCharges': 1, 
    'PrepaidExpensesDefrdChargesGrp': 1, 
    'PrincipalOfficerNm': 1, 
    'PriorExcessBenefitTransaction': 1, 
    'PriorPeriodAdjustmentsAmt': 1, 
    'ProfessionalFundraising': 1, 
    'ProfessionalFundraisingInd': 1, 
    'ProgSrvcAccomActy2Grp': 1, 
    'ProgSrvcAccomActy3Grp': 1, 
    'ProgSrvcAccomActyOtherGrp': 1, 
    'ProgramServiceRevenueCY': 1, 
    'ProgramServiceRevenuePriorYear': 1, 
    'PymtTravelEntrtnmntPubOfclGrp': 1, 
    'RcvblFromDisqualifiedPrsnGrp': 1, 
    'ReceivablesFromDisqualPersons': 1, 
    'ReconcilationDonatedServices': 1, 
    'ReconcilationInvestExpenses': 1, 
    'ReconcilationPriorAdjustment': 1, 
    'ReconcilationRevenueExpenses': 1, 
    'ReconcilationRevenueExpnssAmt': 1, 
    'ReconciliationUnrealizedInvest': 1, 
    'RegularMonitoringEnforcement': 1, 
    'RegularMonitoringEnfrcInd': 1, 
    'RelatedEntity': 1, 
    'RelatedEntityInd': 1, 
    'RelatedOrgControlledEntity': 1, 
    'RelatedOrganizationCtrlEntInd': 1, 
    'RelatedOrganizations': 1, 
    'RelatedOrganizationsAmt': 1, 
    'RetainedEarningsEndowmentEtc': 1, 
    'ReturnTs': 1, 
    'Revenue': 1, 
    'RevenueAmt': 1, 
    'RevenuesLessExpensesCY': 1, 
    'RevenuesLessExpensesPriorYear': 1, 
    'Royalties': 1, 
    'RoyaltiesGrp': 1, 
    'RtnEarnEndowmentIncmOthFndsGrp': 1, 
    'SalariesEtcCurrentYear': 1, 
    'SalariesEtcPriorYear': 1, 
    'SavingsAndTempCashInvestments': 1, 
    'SavingsAndTempCashInvstGrp': 1, 
    'SignificantChange': 1, 
    'SignificantChangeInd': 1, 
    'SignificantNewProgramServices': 1, 
    'SignificantNewProgramSrvcInd': 1, 
    'SpecialConditionDesc': 1, 
    'SpecialConditionDescription': 1, 
    'StateLegalDomicile': 1, 
    'StatesWhereCopyOfReturnIsFiled': 1, 
    'StatesWhereCopyOfReturnIsFldCd': 1, 
    'TaxExemptBondLiabilities': 1, 
    'TaxExemptBondLiabilitiesGrp': 1, 
    'TaxPeriod': 1, 
    'TaxPeriodBeginDate': 1, 
    'TaxPeriodBeginDt': 1, 
    'TaxPeriodEndDate': 1, 
    'TaxPeriodEndDt': 1, 
    'TaxYear': 1, 
    'TaxYr': 1, 
    'TemporarilyRestrictedNetAssets': 1, 
    'TemporarilyRstrNetAssetsGrp': 1, 
    'TerminatedReturn': 1, 
    'TerminationOrContraction': 1, 
    'Timestamp': 1, 
    'TotReportableCompRltdOrgAmt': 1, 
    'TotalAssets': 1, 
    'TotalAssetsBOY': 1, 
    'TotalAssetsBOYAmt': 1, 
    'TotalAssetsEOY': 1, 
    'TotalAssetsEOYAmt': 1, 
    'TotalAssetsGrp': 1, 
    'TotalCompGT150K': 1, 
    'TotalCompGreaterThan150KInd': 1, 
    'TotalContributions': 1, 
    'TotalContributionsAmt': 1, 
    'TotalEmployeeCnt': 1, 
    'TotalExpensesCurrentYear': 1, 
    'TotalExpensesPriorYear': 1, 
    'TotalFunctionalExpenses': 1, 
    'TotalFunctionalExpensesGrp': 1, 
    'TotalFundrsngExpCurrentYear': 1, 
    'TotalGrossUBI': 1, 
    'TotalGrossUBIAmt': 1, 
    'TotalJointCosts': 1, 
    'TotalJointCostsGrp': 1, 
    'TotalLiabilitiesBOY': 1, 
    'TotalLiabilitiesBOYAmt': 1, 
    'TotalLiabilitiesEOY': 1, 
    'TotalLiabilitiesEOYAmt': 1, 
    'TotalNbrEmployees': 1, 
    'TotalNbrVolunteers': 1, 
    'TotalOfOtherProgramServiceExp': 1, 
    'TotalOfOtherProgramServiceGrnt': 1, 
    'TotalOfOtherProgramServiceRev': 1, 
    'TotalOtherCompensation': 1, 
    'TotalOtherCompensationAmt': 1, 
    'TotalOtherProgSrvcExpenseAmt': 1, 
    'TotalOtherProgSrvcGrantAmt': 1, 
    'TotalOtherProgSrvcRevenueAmt': 1, 
    'TotalOtherRevenue': 1, 
    'TotalProfFundrsngExpCY': 1, 
    'TotalProfFundrsngExpPriorYear': 1, 
    'TotalProgramServiceExpense': 1, 
    'TotalProgramServiceExpensesAmt': 1, 
    'TotalProgramServiceRevenue': 1, 
    'TotalProgramServiceRevenueAmt': 1, 
    'TotalReportableCompFrmRltdOrgs': 1, 
    'TotalReportableCompFromOrg': 1, 
    'TotalReportableCompFromOrgAmt': 1, 
    'TotalRevenue': 1, 
    'TotalRevenueCurrentYear': 1, 
    'TotalRevenueGrp': 1, 
    'TotalRevenuePriorYear': 1, 
    'TotalVolunteersCnt': 1, 
    'TransactionRelatedEntity': 1, 
    'TransactionWithControlEntInd': 1, 
    'TransfersToExemptNonChrtblOrg': 1, 
    'Travel': 1, 
    'TravelEntrtnmntPublicOfficials': 1, 
    'TravelGrp': 1, 
    'TrnsfrExmptNonChrtblRltdOrgInd': 1, 
    'TypeOfOrgOtherDescription': 1, 
    'TypeOfOrganizationAssocInd': 1, 
    'TypeOfOrganizationAssociation': 1, 
    'TypeOfOrganizationCorpInd': 1, 
    'TypeOfOrganizationCorporation': 1, 
    'TypeOfOrganizationOther': 1, 
    'TypeOfOrganizationOtherInd': 1, 
    'TypeOfOrganizationTrust': 1, 
    'TypeOfOrganizationTrustInd': 1, 
    'UnrelatedBusIncmOverLimitInd': 1, 
    'UnrelatedBusinessIncome': 1, 
    'UnrestrictedNetAssets': 1, 
    'UnrestrictedNetAssetsGrp': 1, 
    'UnsecuredNotesLoansPayable': 1, 
    'UnsecuredNotesLoansPayableGrp': 1, 
    'UponRequest': 1, 
    'UponRequestInd': 1, 
    'VotingMembersGoverningBodyCnt': 1, 
    'VotingMembersIndependentCnt': 1, 
    'WebSite': 1, 
    'WebsiteAddressTxt': 1, 
    'WhistleblowerPolicy': 1, 
    'WhistleblowerPolicyInd': 1, 
    'WrittenPolicyOrProcedure': 1, 
    'WrittenPolicyOrProcedureInd': 1, 
    'YearFormation': 1})

<br>In the next block we define a function that will allow us to read in the MongoDB data. This is only necessary for very large datasets. 

In [34]:
def batched(cursor, batch_size):
    batch = []
    for doc in cursor:
        batch.append(doc) #<timed exec>:5: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
        if batch and not len(batch) % batch_size:
            yield batch
            batch = []
    if batch:
        yield batch

<br>To deal with the above warning

In [33]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

### Read 990 DB into PANDAS DF
Read verified variables for all filings into a PANDAS dataframe. This will take several hours depending on your machine.

In [35]:
%%time
df = pd.DataFrame()
for batch in batched(cursor, 1000):
    df = df.append(batch, ignore_index=True)    
df[:1]

Wall time: 11h 17min 44s


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,ReturnHeader,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,TypeOfOrganizationCorporation,ActivityOrMissionDescription,NbrVotingMembersGoverningBody,NbrIndependentVotingMembers,TotalNbrEmployees,TotalGrossUBI,ContributionsGrantsPriorYear,ContributionsGrantsCurrentYear,ProgramServiceRevenuePriorYear,ProgramServiceRevenueCY,InvestmentIncomePriorYear,InvestmentIncomeCurrentYear,OtherRevenuePriorYear,OtherRevenueCurrentYear,TotalRevenuePriorYear,TotalRevenueCurrentYear,GrantsAndSimilarAmntsPriorYear,GrantsAndSimilarAmntsCY,BenefitsPaidToMembersCY,SalariesEtcCurrentYear,TotalProfFundrsngExpCY,TotalFundrsngExpCurrentYear,OtherExpensePriorYear,OtherExpensesCurrentYear,TotalExpensesPriorYear,TotalExpensesCurrentYear,RevenuesLessExpensesPriorYear,RevenuesLessExpensesCY,TotalAssetsBOY,TotalAssetsEOY,TotalLiabilitiesEOY,NetAssetsOrFundBalancesBOY,NetAssetsOrFundBalancesEOY,MissionDescription,SignificantNewProgramServices,SignificantChange,Expense,Grants,Revenue,Description,TotalProgramServiceExpense,PoliticalActivities,LobbyingActivities,ProfessionalFundraising,FundraisingActivities,Gaming,ExcessBenefitTransaction,PriorExcessBenefitTransaction,DisregardedEntity,RelatedEntity,RelatedOrgControlledEntity,TransactionRelatedEntity,TransfersToExemptNonChrtblOrg,ActivitiesConductedPartnership,NumberFormsTransmittedWith1096,NumberOfEmployees,UnrelatedBusinessIncome,InfoInScheduleOPartVI,NbrVotingGoverningBodyMembers,NumberIndependentVotingMembers,FamilyOrBusinessRelationship,DelegationOfManagementDuties,ChangesToOrganizingDocs,MaterialDiversionOrMisuse,MembersOrStockholders,ElectionOfBoardMembers,DecisionsSubjectToApproval,MinutesOfGoverningBody,MinutesOfCommittees,OfficerMailingAddress,LocalChapters,Form990ProvidedToGoverningBody,ConflictOfInterestPolicy,WhistleblowerPolicy,DocumentRetentionPolicy,CompensationProcessCEO,CompensationProcessOther,InvestmentInJointVenture,StatesWhereCopyOfReturnIsFiled,UponRequest,NoListedPersonsCompensated,FormersListed,TotalCompGT150K,CompensationFromOtherSources,MembershipDues,AllOtherContributions,TotalContributions,TotalProgramServiceRevenue,TotalRevenue,GrantsToDomesticOrgs,FeesForServicesAccounting,FeesForServicesOther,OfficeExpenses,DepreciationDepletion,OtherExpenses,AllOtherExpenses,TotalFunctionalExpenses,CashNonInterestBearing,LandBuildingsEquipmentBasis,LandBldgEquipmentAccumDeprec,LandBuildingsEquipmentBasisNet,TotalAssets,DoNotFollowSFAS117,RetainedEarningsEndowmentEtc,ReconcilationRevenueExpenses,MethodOfAccountingCash,AccountantCompileOrReview,FSAudited,Organization501c,YearFormation,StateLegalDomicile,TotalNbrVolunteers,SalariesEtcPriorYear,TotalLiabilitiesBOY,Activity2,Activity3,ActivityOther,TotalOfOtherProgramServiceExp,AnnualDisclosureCoveredPersons,RegularMonitoringEnforcement,OtherWebsite,TotalReportableCompFromOrg,TotalOtherCompensation,CompCurrentOfficersDirectors,OtherSalariesAndWages,PensionPlanContributions,OtherEmployeeBenefits,PayrollTaxes,FeesForServicesLegal,FeesForServicesProfFundraising,Advertising,InformationTechnology,Occupancy,Travel,ConferencesMeetings,Interest,Insurance,PledgesAndGrantsReceivable,AccountsReceivable,PrepaidExpensesDeferredCharges,AccountsPayableAccruedExpenses,LoansFromOfficersDirectors,MortNotesPyblSecuredInvestProp,OtherLiabilities,FollowSFAS117,UnrestrictedNetAssets,TemporarilyRestrictedNetAssets,PermanentlyRestrictedNetAssets,MethodOfAccountingAccrual,AuditCommittee,FederalGrantAuditRequired,FederalGrantAuditPerformed,TypeOfOrganizationTrust,NetUnrelatedBusinessTxblIncome,BenefitsPaidToMembersPriorYear,TotalReportableCompFrmRltdOrgs,NumberIndividualsGT100K,NumberOfContractorsGT100K,TotalOtherRevenue,BenefitsToMembers,FeesForServicesManagement,FeesForServicesInvstMgmntFees,InvestmentsPubTradedSecurities,OtherAssetsTotal,DeferredRevenue,SavingsAndTempCashInvestments,InventoriesForSaleOrUse,AllAffiliatesIncluded,PoliciesReferenceChapters,WrittenPolicyOrProcedure,GovernmentGrants,GrantsToDomesticIndividuals,ForeignGrants,CompDisqualPersons,FeesForServicesLobbying,Royalties,TravelEntrtnmntPublicOfficials,PaymentsToAffiliates,ReceivablesFromDisqualPersons,OtherNotesLoansReceivableNet,InvestmentsOtherSecurities,InvestmentsProgramRelated,IntangibleAssets,AmendedReturn,TotalProfFundrsngExpPriorYear,InfoInScheduleOPartIII,InfoInScheduleOPartXII,FundraisingEvents,CntrbtnsRprtdFundraisingEvents,TypeOfOrganizationOther,TypeOfOrgOtherDescription,TypeOfOrganizationAssociation,InfoInScheduleOPartXI,GrossIncomeFundraisingEvents,FundraisingDirectExpenses,InitialReturn,TotalOfOtherProgramServiceRev,GroupExemptionNumber,UnsecuredNotesLoansPayable,NoncashContributions,TotalOfOtherProgramServiceGrnt,GrantsPayable,TaxExemptBondLiabilities,EscrowAccountLiability,OwnWebsite,AddressChange,GrossSalesOfInventory,CostOfGoodsSold,GrossIncomeGaming,GamingDirectExpenses,MethodOfAccountingOther,TotalJointCosts,RelatedOrganizations,InfoInScheduleOPartVII,FederatedCampaigns,InfoInScheduleOPartV,TerminatedReturn,TerminationOrContraction,ActivityCode,Organization4947a1,SpecialConditionDescription,CountryLegalDomicile,InfoInScheduleOPartIX,ReconciliationUnrealizedInvest,ReconcilationPriorAdjustment,ReconcilationDonatedServices,ReconcilationInvestExpenses,InfoInScheduleOPartVIII,InfoInScheduleOPartX,PrincipalOfficerNm,GrossReceiptsAmt,GroupReturnForAffiliatesInd,Organization501c3Ind,TypeOfOrganizationCorpInd,FormationYr,LegalDomicileStateCd,ActivityOrMissionDesc,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,TotalEmployeeCnt,TotalGrossUBIAmt,CYContributionsGrantsAmt,CYProgramServiceRevenueAmt,CYInvestmentIncomeAmt,CYOtherRevenueAmt,CYTotalRevenueAmt,CYGrantsAndSimilarPaidAmt,CYBenefitsPaidToMembersAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalProfFndrsngExpnsAmt,CYTotalFundraisingExpenseAmt,CYOtherExpensesAmt,CYTotalExpensesAmt,CYRevenuesLessExpensesAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalLiabilitiesEOYAmt,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,InfoInScheduleOPartIIIInd,MissionDesc,SignificantNewProgramSrvcInd,SignificantChangeInd,Desc,PoliticalCampaignActyInd,LobbyingActivitiesInd,ProfessionalFundraisingInd,FundraisingActivitiesInd,GamingActivitiesInd,EngagedInExcessBenefitTransInd,PYExcessBenefitTransInd,DisregardedEntityInd,RelatedEntityInd,RelatedOrganizationCtrlEntInd,TransactionWithControlEntInd,TrnsfrExmptNonChrtblRltdOrgInd,ActivitiesConductedPrtshpInd,IRPDocumentCnt,EmployeeCnt,UnrelatedBusIncmOverLimitInd,GoverningBodyVotingMembersCnt,IndependentVotingMemberCnt,FamilyOrBusinessRlnInd,DelegationOfMgmtDutiesInd,ChangeToOrgDocumentsInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,ElectionOfBoardMembersInd,DecisionsSubjectToApprovaInd,MinutesOfGoverningBodyInd,MinutesOfCommitteesInd,OfficerMailingAddressInd,LocalChaptersInd,Form990ProvidedToGvrnBodyInd,ConflictOfInterestPolicyInd,WhistleblowerPolicyInd,DocumentRetentionPolicyInd,CompensationProcessCEOInd,CompensationProcessOtherInd,InvestmentInJointVentureInd,StatesWhereCopyOfReturnIsFldCd,NoListedPersonsCompensatedInd,FormerOfcrEmployeesListedInd,TotalCompGreaterThan150KInd,CompensationFromOtherSrcsInd,MembershipDuesAmt,FundraisingAmt,AllOtherContributionsAmt,TotalContributionsAmt,OtherRevenueTotalAmt,TotalRevenueGrp,FeesForServicesAccountingGrp,OfficeExpensesGrp,InformationTechnologyGrp,ConferencesMeetingsGrp,InsuranceGrp,OtherExpensesGrp,AllOtherExpensesGrp,TotalFunctionalExpensesGrp,CashNonInterestBearingGrp,TotalAssetsGrp,OrgDoesNotFollowSFAS117Ind,RtnEarnEndowmentIncmOthFndsGrp,ReconcilationRevenueExpnssAmt,MethodOfAccountingCashInd,AccountantCompileOrReviewInd,FSAuditedInd,FederalGrantAuditRequiredInd,WebsiteAddressTxt,TotalVolunteersCnt,NetUnrelatedBusTxblIncmAmt,PYContributionsGrantsAmt,PYProgramServiceRevenueAmt,PYInvestmentIncomeAmt,PYOtherRevenueAmt,PYTotalRevenueAmt,PYGrantsAndSimilarPaidAmt,PYBenefitsPaidToMembersAmt,PYSalariesCompEmpBnftPaidAmt,PYTotalProfFndrsngExpnsAmt,PYOtherExpensesAmt,PYTotalExpensesAmt,PYRevenuesLessExpensesAmt,TotalLiabilitiesBOYAmt,ExpenseAmt,GrantAmt,RevenueAmt,ProgSrvcAccomActy2Grp,ProgSrvcAccomActy3Grp,ProgSrvcAccomActyOtherGrp,TotalOtherProgSrvcGrantAmt,TotalProgramServiceExpensesAmt,InfoInScheduleOPartVIInd,AnnualDisclosureCoveredPrsnInd,RegularMonitoringEnfrcInd,UponRequestInd,TotalReportableCompFromOrgAmt,TotReportableCompRltdOrgAmt,TotalOtherCompensationAmt,IndivRcvdGreaterThan100KCnt,CntrctRcvdGreaterThan100KCnt,GovernmentGrantsAmt,TotalProgramServiceRevenueAmt,FundraisingGrossIncomeAmt,ContriRptFundraisingEventAmt,FundraisingDirectExpensesAmt,GrossSalesOfInventoryAmt,CostOfGoodsSoldAmt,GrantsToDomesticIndividualsGrp,CompCurrentOfcrDirectorsGrp,OtherSalariesAndWagesGrp,PensionPlanContributionsGrp,OtherEmployeeBenefitsGrp,PayrollTaxesGrp,FeesForServicesOtherGrp,AdvertisingGrp,TravelGrp,InterestGrp,DepreciationDepletionGrp,SavingsAndTempCashInvstGrp,AccountsReceivableGrp,InventoriesForSaleOrUseGrp,PrepaidExpensesDefrdChargesGrp,LandBldgEquipCostOrOtherBssAmt,LandBldgEquipAccumDeprecAmt,LandBldgEquipBasisNetGrp,InvestmentsOtherSecuritiesGrp,IntangibleAssetsGrp,AccountsPayableAccrExpnssGrp,DeferredRevenueGrp,MortgNotesPyblScrdInvstPropGrp,OtherLiabilitiesGrp,OrganizationFollowsSFAS117Ind,UnrestrictedNetAssetsGrp,TemporarilyRstrNetAssetsGrp,InfoInScheduleOPartXIInd,NetUnrlzdGainsLossesInvstAmt,InfoInScheduleOPartXIIInd,AuditCommitteeInd,AllAffiliatesIncludedInd,GrantsToDomesticOrgsGrp,ForeignGrantsGrp,BenefitsToMembersGrp,CompDisqualPersonsGrp,FeesForServicesManagementGrp,FeesForServicesLegalGrp,FeesForServicesLobbyingGrp,FeesForSrvcInvstMgmntFeesGrp,RoyaltiesGrp,OccupancyGrp,PymtTravelEntrtnmntPubOfclGrp,PaymentsToAffiliatesGrp,PledgesAndGrantsReceivableGrp,RcvblFromDisqualifiedPrsnGrp,OthNotesLoansReceivableNetGrp,InvestmentsPubTradedSecGrp,InvestmentsProgramRelatedGrp,OtherAssetsTotalGrp,TotalOtherProgSrvcExpenseAmt,InfoInScheduleOPartVInd,MethodOfAccountingAccrualInd,NoncashContributionsAmt,GrantsPayableGrp,PermanentlyRstrNetAssetsGrp,TaxExemptBondLiabilitiesGrp,EscrowAccountLiabilityGrp,LoansFromOfficersDirectorsGrp,UnsecuredNotesLoansPayableGrp,PriorPeriodAdjustmentsAmt,FederalGrantAuditPerformedInd,PoliciesReferenceChaptersInd,OtherWebsiteInd,AddressChangeInd,WrittenPolicyOrProcedureInd,RelatedOrganizationsAmt,TotalOtherProgSrvcRevenueAmt,OwnWebsiteInd,TotalJointCostsGrp,DonatedServicesAndUseFcltsAmt,LegalDomicileCountryCd,InfoInScheduleOPartIXInd,TypeOfOrganizationTrustInd,FinalReturnInd,ContractTerminationInd,InfoInScheduleOPartXInd,GroupExemptionNum,InfoInScheduleOPartVIIInd,FederatedCampaignsAmt,TypeOfOrganizationOtherInd,OtherOrganizationDsc,InfoInScheduleOPartVIIIInd,TypeOfOrganizationAssocInd,InitialReturnInd,GamingGrossIncomeAmt,GamingDirectExpensesAmt,MethodOfAccountingOtherInd,InvestmentExpenseAmt,Organization501cInd,Organization4947a1NotPFInd,AmendedReturnInd,SpecialConditionDesc,ActivityCd
0,NEW ALBANY WALKING CLUB INC,https://s3.amazonaws.com/irs-form-990/201121029349300402_public.xml,93493102004021,201012,203840246,"{'@binaryAttachmentCount': '0', 'Timestamp': '2011-04-12T08:58:08-05:00', 'TaxPeriodEndDate': '2010-12-31', 'PreparerFirm': {'PreparerFirmBusinessName': {'BusinessNameLine1': 'WHALEN & COMPANY CPAS'}, 'PreparerFirmUSAddress': {'AddressLine1': '25...",PHILIP HEIT,299757,False,X,NEWALBANYWALKINGCLUB.COM,X,A CLUB DEDICATED TO PROMOTING WALKING FOR HEALTH AND COMPETITION.,7,7,0,0,105000,172910,134744,126722,131,125,2094,0,241969,299757,110600,114700,0,0,0,6717,104478,122499,215078,237199,26891,62558,105105,167663,0,105105,167663,A CLUB DEDICATED TO PROMOTING WALKING FOR HEALTH AND COMPETITION.,False,False,228055,114700,126847,"TO PROMOTE HEALTH AND PREVENT DISEASE, ENCOURAGE A SUPPORTIVE ENVIRONMENT, AND ELEVATE THE STATUS OF WALKING AND ITS BENEFITS TO INDIVIDUALS AND THE CENTRAL OHIO COMMUNITY.",228055,False,False,False,False,False,False,False,False,False,False,False,False,False,0,0,False,X,7,7,True,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,OH,X,X,False,False,False,910,172000,172910,126722,"{'TotalRevenueColumn': '299757', 'RelatedOrExemptFunctionIncome': '126847'}","{'Total': '114700', 'ProgramServices': '114700'}","{'Total': '1000', 'ManagementAndGeneral': '1000'}","{'Total': '75', 'ManagementAndGeneral': '75'}","{'Total': '1352', 'ManagementAndGeneral': '1352'}","{'Total': '1213', 'ProgramServices': '1213'}","[{'Description': 'RACE CLOTHING', 'Total': '37423', 'ProgramServices': '37423'}, {'Description': 'RACE MANAGEMENT FEE', 'Total': '30069', 'ProgramServices': '30069'}, {'Description': 'RACE EXPENSES', 'Total': '17372', 'ProgramServices': '17372'},...","{'Total': '19955', 'ProgramServices': '19955'}","{'Total': '237199', 'ProgramServices': '228055', 'ManagementAndGeneral': '2427', 'Fundraising': '6717'}","{'BOY': '104568', 'EOY': '167341'}",2883,2561,"{'BOY': '537', 'EOY': '322'}","{'BOY': '105105', 'EOY': '167663'}",X,"{'BOY': '105105', 'EOY': '167663'}",62558,X,False,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [36]:
print(len(df))

2104435


#### Save DF
We will save the dataset in gzipped PANDAS format. This is a very large file so it will take some time. 

In [37]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df.to_pickle('all filings August 2022 - all control variables.pkl.gz', compression='gzip')

Current date and time :  2022-08-03 12:04:11 

Wall time: 48min 45s


### Process *ReturnHeader* column
The ``ReturnHeader`` column contains some key pieces of information on the organization and its 990 filing. In the XML and JSON versions of the file these data are all 'nested' under the *ReturnHeader*. We thus need to 'flatten' these data such that each variable has its own column. For this task we are going to apply the ``json_normalize`` function in PANDAS. What the code below is saying is (re-)create our dataframe ``df`` by joining ``df`` without *ReturnHeader* with the flattened *ReturnHeader* columns. The new ``df`` will have the same number of rows but more columns -- instead of one *ReturnHeader* column we will have multiple new, non-nested columns. 

In [38]:
%%time
df = pd.concat([df.drop(['ReturnHeader'], axis=1), pd.json_normalize(df['ReturnHeader'], max_level=0)], axis=1)
print(len(df))
df[:1]

2104435
Wall time: 7min 7s


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,TypeOfOrganizationCorporation,ActivityOrMissionDescription,NbrVotingMembersGoverningBody,NbrIndependentVotingMembers,TotalNbrEmployees,TotalGrossUBI,ContributionsGrantsPriorYear,ContributionsGrantsCurrentYear,ProgramServiceRevenuePriorYear,ProgramServiceRevenueCY,InvestmentIncomePriorYear,InvestmentIncomeCurrentYear,OtherRevenuePriorYear,OtherRevenueCurrentYear,TotalRevenuePriorYear,TotalRevenueCurrentYear,GrantsAndSimilarAmntsPriorYear,GrantsAndSimilarAmntsCY,BenefitsPaidToMembersCY,SalariesEtcCurrentYear,TotalProfFundrsngExpCY,TotalFundrsngExpCurrentYear,OtherExpensePriorYear,OtherExpensesCurrentYear,TotalExpensesPriorYear,TotalExpensesCurrentYear,RevenuesLessExpensesPriorYear,RevenuesLessExpensesCY,TotalAssetsBOY,TotalAssetsEOY,TotalLiabilitiesEOY,NetAssetsOrFundBalancesBOY,NetAssetsOrFundBalancesEOY,MissionDescription,SignificantNewProgramServices,SignificantChange,Expense,Grants,Revenue,Description,TotalProgramServiceExpense,PoliticalActivities,LobbyingActivities,ProfessionalFundraising,FundraisingActivities,Gaming,ExcessBenefitTransaction,PriorExcessBenefitTransaction,DisregardedEntity,RelatedEntity,RelatedOrgControlledEntity,TransactionRelatedEntity,TransfersToExemptNonChrtblOrg,ActivitiesConductedPartnership,NumberFormsTransmittedWith1096,NumberOfEmployees,UnrelatedBusinessIncome,InfoInScheduleOPartVI,NbrVotingGoverningBodyMembers,NumberIndependentVotingMembers,FamilyOrBusinessRelationship,DelegationOfManagementDuties,ChangesToOrganizingDocs,MaterialDiversionOrMisuse,MembersOrStockholders,ElectionOfBoardMembers,DecisionsSubjectToApproval,MinutesOfGoverningBody,MinutesOfCommittees,OfficerMailingAddress,LocalChapters,Form990ProvidedToGoverningBody,ConflictOfInterestPolicy,WhistleblowerPolicy,DocumentRetentionPolicy,CompensationProcessCEO,CompensationProcessOther,InvestmentInJointVenture,StatesWhereCopyOfReturnIsFiled,UponRequest,NoListedPersonsCompensated,FormersListed,TotalCompGT150K,CompensationFromOtherSources,MembershipDues,AllOtherContributions,TotalContributions,TotalProgramServiceRevenue,TotalRevenue,GrantsToDomesticOrgs,FeesForServicesAccounting,FeesForServicesOther,OfficeExpenses,DepreciationDepletion,OtherExpenses,AllOtherExpenses,TotalFunctionalExpenses,CashNonInterestBearing,LandBuildingsEquipmentBasis,LandBldgEquipmentAccumDeprec,LandBuildingsEquipmentBasisNet,TotalAssets,DoNotFollowSFAS117,RetainedEarningsEndowmentEtc,ReconcilationRevenueExpenses,MethodOfAccountingCash,AccountantCompileOrReview,FSAudited,Organization501c,YearFormation,StateLegalDomicile,TotalNbrVolunteers,SalariesEtcPriorYear,TotalLiabilitiesBOY,Activity2,Activity3,ActivityOther,TotalOfOtherProgramServiceExp,AnnualDisclosureCoveredPersons,RegularMonitoringEnforcement,OtherWebsite,TotalReportableCompFromOrg,TotalOtherCompensation,CompCurrentOfficersDirectors,OtherSalariesAndWages,PensionPlanContributions,OtherEmployeeBenefits,PayrollTaxes,FeesForServicesLegal,FeesForServicesProfFundraising,Advertising,InformationTechnology,Occupancy,Travel,ConferencesMeetings,Interest,Insurance,PledgesAndGrantsReceivable,AccountsReceivable,PrepaidExpensesDeferredCharges,AccountsPayableAccruedExpenses,LoansFromOfficersDirectors,MortNotesPyblSecuredInvestProp,OtherLiabilities,FollowSFAS117,UnrestrictedNetAssets,TemporarilyRestrictedNetAssets,PermanentlyRestrictedNetAssets,MethodOfAccountingAccrual,AuditCommittee,FederalGrantAuditRequired,FederalGrantAuditPerformed,TypeOfOrganizationTrust,NetUnrelatedBusinessTxblIncome,BenefitsPaidToMembersPriorYear,TotalReportableCompFrmRltdOrgs,NumberIndividualsGT100K,NumberOfContractorsGT100K,TotalOtherRevenue,BenefitsToMembers,FeesForServicesManagement,FeesForServicesInvstMgmntFees,InvestmentsPubTradedSecurities,OtherAssetsTotal,DeferredRevenue,SavingsAndTempCashInvestments,InventoriesForSaleOrUse,AllAffiliatesIncluded,PoliciesReferenceChapters,WrittenPolicyOrProcedure,GovernmentGrants,GrantsToDomesticIndividuals,ForeignGrants,CompDisqualPersons,FeesForServicesLobbying,Royalties,TravelEntrtnmntPublicOfficials,PaymentsToAffiliates,ReceivablesFromDisqualPersons,OtherNotesLoansReceivableNet,InvestmentsOtherSecurities,InvestmentsProgramRelated,IntangibleAssets,AmendedReturn,TotalProfFundrsngExpPriorYear,InfoInScheduleOPartIII,InfoInScheduleOPartXII,FundraisingEvents,CntrbtnsRprtdFundraisingEvents,TypeOfOrganizationOther,TypeOfOrgOtherDescription,TypeOfOrganizationAssociation,InfoInScheduleOPartXI,GrossIncomeFundraisingEvents,FundraisingDirectExpenses,InitialReturn,TotalOfOtherProgramServiceRev,GroupExemptionNumber,UnsecuredNotesLoansPayable,NoncashContributions,TotalOfOtherProgramServiceGrnt,GrantsPayable,TaxExemptBondLiabilities,EscrowAccountLiability,OwnWebsite,AddressChange,GrossSalesOfInventory,CostOfGoodsSold,GrossIncomeGaming,GamingDirectExpenses,MethodOfAccountingOther,TotalJointCosts,RelatedOrganizations,InfoInScheduleOPartVII,FederatedCampaigns,InfoInScheduleOPartV,TerminatedReturn,TerminationOrContraction,ActivityCode,Organization4947a1,SpecialConditionDescription,CountryLegalDomicile,InfoInScheduleOPartIX,ReconciliationUnrealizedInvest,ReconcilationPriorAdjustment,ReconcilationDonatedServices,ReconcilationInvestExpenses,InfoInScheduleOPartVIII,InfoInScheduleOPartX,PrincipalOfficerNm,GrossReceiptsAmt,GroupReturnForAffiliatesInd,Organization501c3Ind,TypeOfOrganizationCorpInd,FormationYr,LegalDomicileStateCd,ActivityOrMissionDesc,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,TotalEmployeeCnt,TotalGrossUBIAmt,CYContributionsGrantsAmt,CYProgramServiceRevenueAmt,CYInvestmentIncomeAmt,CYOtherRevenueAmt,CYTotalRevenueAmt,CYGrantsAndSimilarPaidAmt,CYBenefitsPaidToMembersAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalProfFndrsngExpnsAmt,CYTotalFundraisingExpenseAmt,CYOtherExpensesAmt,CYTotalExpensesAmt,CYRevenuesLessExpensesAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalLiabilitiesEOYAmt,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,InfoInScheduleOPartIIIInd,MissionDesc,SignificantNewProgramSrvcInd,SignificantChangeInd,Desc,PoliticalCampaignActyInd,LobbyingActivitiesInd,ProfessionalFundraisingInd,FundraisingActivitiesInd,GamingActivitiesInd,EngagedInExcessBenefitTransInd,PYExcessBenefitTransInd,DisregardedEntityInd,RelatedEntityInd,RelatedOrganizationCtrlEntInd,TransactionWithControlEntInd,TrnsfrExmptNonChrtblRltdOrgInd,ActivitiesConductedPrtshpInd,IRPDocumentCnt,EmployeeCnt,UnrelatedBusIncmOverLimitInd,GoverningBodyVotingMembersCnt,IndependentVotingMemberCnt,FamilyOrBusinessRlnInd,DelegationOfMgmtDutiesInd,ChangeToOrgDocumentsInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,ElectionOfBoardMembersInd,DecisionsSubjectToApprovaInd,MinutesOfGoverningBodyInd,MinutesOfCommitteesInd,OfficerMailingAddressInd,LocalChaptersInd,Form990ProvidedToGvrnBodyInd,ConflictOfInterestPolicyInd,WhistleblowerPolicyInd,DocumentRetentionPolicyInd,CompensationProcessCEOInd,CompensationProcessOtherInd,InvestmentInJointVentureInd,StatesWhereCopyOfReturnIsFldCd,NoListedPersonsCompensatedInd,FormerOfcrEmployeesListedInd,TotalCompGreaterThan150KInd,CompensationFromOtherSrcsInd,MembershipDuesAmt,FundraisingAmt,AllOtherContributionsAmt,TotalContributionsAmt,OtherRevenueTotalAmt,TotalRevenueGrp,FeesForServicesAccountingGrp,OfficeExpensesGrp,InformationTechnologyGrp,ConferencesMeetingsGrp,InsuranceGrp,OtherExpensesGrp,AllOtherExpensesGrp,TotalFunctionalExpensesGrp,CashNonInterestBearingGrp,TotalAssetsGrp,OrgDoesNotFollowSFAS117Ind,RtnEarnEndowmentIncmOthFndsGrp,ReconcilationRevenueExpnssAmt,MethodOfAccountingCashInd,AccountantCompileOrReviewInd,FSAuditedInd,FederalGrantAuditRequiredInd,WebsiteAddressTxt,TotalVolunteersCnt,NetUnrelatedBusTxblIncmAmt,PYContributionsGrantsAmt,PYProgramServiceRevenueAmt,PYInvestmentIncomeAmt,PYOtherRevenueAmt,PYTotalRevenueAmt,PYGrantsAndSimilarPaidAmt,PYBenefitsPaidToMembersAmt,PYSalariesCompEmpBnftPaidAmt,PYTotalProfFndrsngExpnsAmt,PYOtherExpensesAmt,PYTotalExpensesAmt,PYRevenuesLessExpensesAmt,TotalLiabilitiesBOYAmt,ExpenseAmt,GrantAmt,RevenueAmt,ProgSrvcAccomActy2Grp,ProgSrvcAccomActy3Grp,ProgSrvcAccomActyOtherGrp,TotalOtherProgSrvcGrantAmt,TotalProgramServiceExpensesAmt,InfoInScheduleOPartVIInd,AnnualDisclosureCoveredPrsnInd,RegularMonitoringEnfrcInd,UponRequestInd,TotalReportableCompFromOrgAmt,TotReportableCompRltdOrgAmt,TotalOtherCompensationAmt,IndivRcvdGreaterThan100KCnt,CntrctRcvdGreaterThan100KCnt,GovernmentGrantsAmt,TotalProgramServiceRevenueAmt,FundraisingGrossIncomeAmt,ContriRptFundraisingEventAmt,FundraisingDirectExpensesAmt,GrossSalesOfInventoryAmt,CostOfGoodsSoldAmt,GrantsToDomesticIndividualsGrp,CompCurrentOfcrDirectorsGrp,OtherSalariesAndWagesGrp,PensionPlanContributionsGrp,OtherEmployeeBenefitsGrp,PayrollTaxesGrp,FeesForServicesOtherGrp,AdvertisingGrp,TravelGrp,InterestGrp,DepreciationDepletionGrp,SavingsAndTempCashInvstGrp,AccountsReceivableGrp,InventoriesForSaleOrUseGrp,PrepaidExpensesDefrdChargesGrp,LandBldgEquipCostOrOtherBssAmt,LandBldgEquipAccumDeprecAmt,LandBldgEquipBasisNetGrp,InvestmentsOtherSecuritiesGrp,IntangibleAssetsGrp,AccountsPayableAccrExpnssGrp,DeferredRevenueGrp,MortgNotesPyblScrdInvstPropGrp,OtherLiabilitiesGrp,OrganizationFollowsSFAS117Ind,UnrestrictedNetAssetsGrp,TemporarilyRstrNetAssetsGrp,InfoInScheduleOPartXIInd,NetUnrlzdGainsLossesInvstAmt,InfoInScheduleOPartXIIInd,AuditCommitteeInd,AllAffiliatesIncludedInd,GrantsToDomesticOrgsGrp,ForeignGrantsGrp,BenefitsToMembersGrp,CompDisqualPersonsGrp,FeesForServicesManagementGrp,FeesForServicesLegalGrp,FeesForServicesLobbyingGrp,FeesForSrvcInvstMgmntFeesGrp,RoyaltiesGrp,OccupancyGrp,PymtTravelEntrtnmntPubOfclGrp,PaymentsToAffiliatesGrp,PledgesAndGrantsReceivableGrp,RcvblFromDisqualifiedPrsnGrp,OthNotesLoansReceivableNetGrp,InvestmentsPubTradedSecGrp,InvestmentsProgramRelatedGrp,OtherAssetsTotalGrp,TotalOtherProgSrvcExpenseAmt,InfoInScheduleOPartVInd,MethodOfAccountingAccrualInd,NoncashContributionsAmt,GrantsPayableGrp,PermanentlyRstrNetAssetsGrp,TaxExemptBondLiabilitiesGrp,EscrowAccountLiabilityGrp,LoansFromOfficersDirectorsGrp,UnsecuredNotesLoansPayableGrp,PriorPeriodAdjustmentsAmt,FederalGrantAuditPerformedInd,PoliciesReferenceChaptersInd,OtherWebsiteInd,AddressChangeInd,WrittenPolicyOrProcedureInd,RelatedOrganizationsAmt,TotalOtherProgSrvcRevenueAmt,OwnWebsiteInd,TotalJointCostsGrp,DonatedServicesAndUseFcltsAmt,LegalDomicileCountryCd,InfoInScheduleOPartIXInd,TypeOfOrganizationTrustInd,FinalReturnInd,ContractTerminationInd,InfoInScheduleOPartXInd,GroupExemptionNum,InfoInScheduleOPartVIIInd,FederatedCampaignsAmt,TypeOfOrganizationOtherInd,OtherOrganizationDsc,InfoInScheduleOPartVIIIInd,TypeOfOrganizationAssocInd,InitialReturnInd,GamingGrossIncomeAmt,GamingDirectExpensesAmt,MethodOfAccountingOtherInd,InvestmentExpenseAmt,Organization501cInd,Organization4947a1NotPFInd,AmendedReturnInd,SpecialConditionDesc,ActivityCd,@binaryAttachmentCount,Timestamp,TaxPeriodEndDate,PreparerFirm,ReturnType,TaxPeriodBeginDate,Filer,Officer,Preparer,TaxYear,BuildTS,DisasterRelief,@binaryAttachmentCnt,ReturnTs,TaxPeriodEndDt,PreparerFirmGrp,ReturnTypeCd,TaxPeriodBeginDt,BusinessOfficerGrp,PreparerPersonGrp,TaxYr,DisasterReliefTxt,FilingSecurityInformation,SigningOfficerGrp,AdditionalFilerInformation
0,NEW ALBANY WALKING CLUB INC,https://s3.amazonaws.com/irs-form-990/201121029349300402_public.xml,93493102004021,201012,203840246,PHILIP HEIT,299757,False,X,NEWALBANYWALKINGCLUB.COM,X,A CLUB DEDICATED TO PROMOTING WALKING FOR HEALTH AND COMPETITION.,7,7,0,0,105000,172910,134744,126722,131,125,2094,0,241969,299757,110600,114700,0,0,0,6717,104478,122499,215078,237199,26891,62558,105105,167663,0,105105,167663,A CLUB DEDICATED TO PROMOTING WALKING FOR HEALTH AND COMPETITION.,False,False,228055,114700,126847,"TO PROMOTE HEALTH AND PREVENT DISEASE, ENCOURAGE A SUPPORTIVE ENVIRONMENT, AND ELEVATE THE STATUS OF WALKING AND ITS BENEFITS TO INDIVIDUALS AND THE CENTRAL OHIO COMMUNITY.",228055,False,False,False,False,False,False,False,False,False,False,False,False,False,0,0,False,X,7,7,True,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,OH,X,X,False,False,False,910,172000,172910,126722,"{'TotalRevenueColumn': '299757', 'RelatedOrExemptFunctionIncome': '126847'}","{'Total': '114700', 'ProgramServices': '114700'}","{'Total': '1000', 'ManagementAndGeneral': '1000'}","{'Total': '75', 'ManagementAndGeneral': '75'}","{'Total': '1352', 'ManagementAndGeneral': '1352'}","{'Total': '1213', 'ProgramServices': '1213'}","[{'Description': 'RACE CLOTHING', 'Total': '37423', 'ProgramServices': '37423'}, {'Description': 'RACE MANAGEMENT FEE', 'Total': '30069', 'ProgramServices': '30069'}, {'Description': 'RACE EXPENSES', 'Total': '17372', 'ProgramServices': '17372'},...","{'Total': '19955', 'ProgramServices': '19955'}","{'Total': '237199', 'ProgramServices': '228055', 'ManagementAndGeneral': '2427', 'Fundraising': '6717'}","{'BOY': '104568', 'EOY': '167341'}",2883,2561,"{'BOY': '537', 'EOY': '322'}","{'BOY': '105105', 'EOY': '167663'}",X,"{'BOY': '105105', 'EOY': '167663'}",62558,X,False,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,2011-04-12T08:58:08-05:00,2010-12-31,"{'PreparerFirmBusinessName': {'BusinessNameLine1': 'WHALEN & COMPANY CPAS'}, 'PreparerFirmUSAddress': {'AddressLine1': '250 WEST OLD WILSON BRIDGE ROAD STE', 'City': 'WORTHINGTON', 'State': 'OH', 'ZIPCode': '43085'}}",990,2010-01-01,"{'EIN': '203840246', 'Name': {'BusinessNameLine1': 'NEW ALBANY WALKING CLUB INC'}, 'NameControl': 'NEWA', 'USAddress': {'AddressLine1': '4000 BAUGHMAN GRANT', 'City': 'NEW ALBANY', 'State': 'OH', 'ZIPCode': '43054'}}","{'Name': 'PHILIP HEIT', 'Title': 'PRESIDENT', 'DateSigned': '2011-03-22', 'AuthorizeThirdParty': 'true'}","{'Name': 'L WOJCIECHOWSKI CPA', 'Phone': '6143964200', 'DatePrepared': '2011-03-28'}",2010,2016-02-24 21:20:13Z,,,,,,,,,,,,,,


<br>We now have 25 additional columns as seen in the following block.

In [52]:
print('# of columns in df:', len(df.columns), '\n')
df.columns[-25:]

# of columns in df: 497 



Index(['@binaryAttachmentCount', 'Timestamp', 'TaxPeriodEndDate',
       'PreparerFirm', 'ReturnType', 'TaxPeriodBeginDate', 'Filer', 'Officer',
       'Preparer', 'TaxYear', 'BuildTS', 'DisasterRelief',
       '@binaryAttachmentCnt', 'ReturnTs', 'TaxPeriodEndDt', 'PreparerFirmGrp',
       'ReturnTypeCd', 'TaxPeriodBeginDt', 'BusinessOfficerGrp',
       'PreparerPersonGrp', 'TaxYr', 'DisasterReliefTxt',
       'FilingSecurityInformation', 'SigningOfficerGrp',
       'AdditionalFilerInformation'],
      dtype='object')

<br>Not all of these contain information that is useful for us, so we will delete some of the unneeded *ReturnHeader* columns

In [53]:
%%time
print([c for c in df.columns.tolist() if c not in mongo_cols])
omit_cols = ['@binaryAttachmentCount', 'PreparerFirm', 'ReturnType',  
             'Preparer', 'DisasterRelief', '@binaryAttachmentCnt', 'PreparerFirmGrp', 'ReturnTypeCd',  
             'PreparerPersonGrp', 'DisasterReliefTxt', 'FilingSecurityInformation']
print(len(df.columns))
df = df[[c for c in df.columns.tolist() if c not in omit_cols]]
print(len(df))
print(len(df.columns))
df[:1]

['OrganizationName', 'URL', 'DLN', 'EIN', '@binaryAttachmentCount', 'PreparerFirm', 'ReturnType', 'Preparer', 'DisasterRelief', '@binaryAttachmentCnt', 'PreparerFirmGrp', 'ReturnTypeCd', 'PreparerPersonGrp', 'DisasterReliefTxt', 'FilingSecurityInformation', 'SigningOfficerGrp', 'AdditionalFilerInformation']
497
2104435
486
Wall time: 1min 58s


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,TypeOfOrganizationCorporation,ActivityOrMissionDescription,NbrVotingMembersGoverningBody,NbrIndependentVotingMembers,TotalNbrEmployees,TotalGrossUBI,ContributionsGrantsPriorYear,ContributionsGrantsCurrentYear,ProgramServiceRevenuePriorYear,ProgramServiceRevenueCY,InvestmentIncomePriorYear,InvestmentIncomeCurrentYear,OtherRevenuePriorYear,OtherRevenueCurrentYear,TotalRevenuePriorYear,TotalRevenueCurrentYear,GrantsAndSimilarAmntsPriorYear,GrantsAndSimilarAmntsCY,BenefitsPaidToMembersCY,SalariesEtcCurrentYear,TotalProfFundrsngExpCY,TotalFundrsngExpCurrentYear,OtherExpensePriorYear,OtherExpensesCurrentYear,TotalExpensesPriorYear,TotalExpensesCurrentYear,RevenuesLessExpensesPriorYear,RevenuesLessExpensesCY,TotalAssetsBOY,TotalAssetsEOY,TotalLiabilitiesEOY,NetAssetsOrFundBalancesBOY,NetAssetsOrFundBalancesEOY,MissionDescription,SignificantNewProgramServices,SignificantChange,Expense,Grants,Revenue,Description,TotalProgramServiceExpense,PoliticalActivities,LobbyingActivities,ProfessionalFundraising,FundraisingActivities,Gaming,ExcessBenefitTransaction,PriorExcessBenefitTransaction,DisregardedEntity,RelatedEntity,RelatedOrgControlledEntity,TransactionRelatedEntity,TransfersToExemptNonChrtblOrg,ActivitiesConductedPartnership,NumberFormsTransmittedWith1096,NumberOfEmployees,UnrelatedBusinessIncome,InfoInScheduleOPartVI,NbrVotingGoverningBodyMembers,NumberIndependentVotingMembers,FamilyOrBusinessRelationship,DelegationOfManagementDuties,ChangesToOrganizingDocs,MaterialDiversionOrMisuse,MembersOrStockholders,ElectionOfBoardMembers,DecisionsSubjectToApproval,MinutesOfGoverningBody,MinutesOfCommittees,OfficerMailingAddress,LocalChapters,Form990ProvidedToGoverningBody,ConflictOfInterestPolicy,WhistleblowerPolicy,DocumentRetentionPolicy,CompensationProcessCEO,CompensationProcessOther,InvestmentInJointVenture,StatesWhereCopyOfReturnIsFiled,UponRequest,NoListedPersonsCompensated,FormersListed,TotalCompGT150K,CompensationFromOtherSources,MembershipDues,AllOtherContributions,TotalContributions,TotalProgramServiceRevenue,TotalRevenue,GrantsToDomesticOrgs,FeesForServicesAccounting,FeesForServicesOther,OfficeExpenses,DepreciationDepletion,OtherExpenses,AllOtherExpenses,TotalFunctionalExpenses,CashNonInterestBearing,LandBuildingsEquipmentBasis,LandBldgEquipmentAccumDeprec,LandBuildingsEquipmentBasisNet,TotalAssets,DoNotFollowSFAS117,RetainedEarningsEndowmentEtc,ReconcilationRevenueExpenses,MethodOfAccountingCash,AccountantCompileOrReview,FSAudited,Organization501c,YearFormation,StateLegalDomicile,TotalNbrVolunteers,SalariesEtcPriorYear,TotalLiabilitiesBOY,Activity2,Activity3,ActivityOther,TotalOfOtherProgramServiceExp,AnnualDisclosureCoveredPersons,RegularMonitoringEnforcement,OtherWebsite,TotalReportableCompFromOrg,TotalOtherCompensation,CompCurrentOfficersDirectors,OtherSalariesAndWages,PensionPlanContributions,OtherEmployeeBenefits,PayrollTaxes,FeesForServicesLegal,FeesForServicesProfFundraising,Advertising,InformationTechnology,Occupancy,Travel,ConferencesMeetings,Interest,Insurance,PledgesAndGrantsReceivable,AccountsReceivable,PrepaidExpensesDeferredCharges,AccountsPayableAccruedExpenses,LoansFromOfficersDirectors,MortNotesPyblSecuredInvestProp,OtherLiabilities,FollowSFAS117,UnrestrictedNetAssets,TemporarilyRestrictedNetAssets,PermanentlyRestrictedNetAssets,MethodOfAccountingAccrual,AuditCommittee,FederalGrantAuditRequired,FederalGrantAuditPerformed,TypeOfOrganizationTrust,NetUnrelatedBusinessTxblIncome,BenefitsPaidToMembersPriorYear,TotalReportableCompFrmRltdOrgs,NumberIndividualsGT100K,NumberOfContractorsGT100K,TotalOtherRevenue,BenefitsToMembers,FeesForServicesManagement,FeesForServicesInvstMgmntFees,InvestmentsPubTradedSecurities,OtherAssetsTotal,DeferredRevenue,SavingsAndTempCashInvestments,InventoriesForSaleOrUse,AllAffiliatesIncluded,PoliciesReferenceChapters,WrittenPolicyOrProcedure,GovernmentGrants,GrantsToDomesticIndividuals,ForeignGrants,CompDisqualPersons,FeesForServicesLobbying,Royalties,TravelEntrtnmntPublicOfficials,PaymentsToAffiliates,ReceivablesFromDisqualPersons,OtherNotesLoansReceivableNet,InvestmentsOtherSecurities,InvestmentsProgramRelated,IntangibleAssets,AmendedReturn,TotalProfFundrsngExpPriorYear,InfoInScheduleOPartIII,InfoInScheduleOPartXII,FundraisingEvents,CntrbtnsRprtdFundraisingEvents,TypeOfOrganizationOther,TypeOfOrgOtherDescription,TypeOfOrganizationAssociation,InfoInScheduleOPartXI,GrossIncomeFundraisingEvents,FundraisingDirectExpenses,InitialReturn,TotalOfOtherProgramServiceRev,GroupExemptionNumber,UnsecuredNotesLoansPayable,NoncashContributions,TotalOfOtherProgramServiceGrnt,GrantsPayable,TaxExemptBondLiabilities,EscrowAccountLiability,OwnWebsite,AddressChange,GrossSalesOfInventory,CostOfGoodsSold,GrossIncomeGaming,GamingDirectExpenses,MethodOfAccountingOther,TotalJointCosts,RelatedOrganizations,InfoInScheduleOPartVII,FederatedCampaigns,InfoInScheduleOPartV,TerminatedReturn,TerminationOrContraction,ActivityCode,Organization4947a1,SpecialConditionDescription,CountryLegalDomicile,InfoInScheduleOPartIX,ReconciliationUnrealizedInvest,ReconcilationPriorAdjustment,ReconcilationDonatedServices,ReconcilationInvestExpenses,InfoInScheduleOPartVIII,InfoInScheduleOPartX,PrincipalOfficerNm,GrossReceiptsAmt,GroupReturnForAffiliatesInd,Organization501c3Ind,TypeOfOrganizationCorpInd,FormationYr,LegalDomicileStateCd,ActivityOrMissionDesc,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,TotalEmployeeCnt,TotalGrossUBIAmt,CYContributionsGrantsAmt,CYProgramServiceRevenueAmt,CYInvestmentIncomeAmt,CYOtherRevenueAmt,CYTotalRevenueAmt,CYGrantsAndSimilarPaidAmt,CYBenefitsPaidToMembersAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalProfFndrsngExpnsAmt,CYTotalFundraisingExpenseAmt,CYOtherExpensesAmt,CYTotalExpensesAmt,CYRevenuesLessExpensesAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalLiabilitiesEOYAmt,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,InfoInScheduleOPartIIIInd,MissionDesc,SignificantNewProgramSrvcInd,SignificantChangeInd,Desc,PoliticalCampaignActyInd,LobbyingActivitiesInd,ProfessionalFundraisingInd,FundraisingActivitiesInd,GamingActivitiesInd,EngagedInExcessBenefitTransInd,PYExcessBenefitTransInd,DisregardedEntityInd,RelatedEntityInd,RelatedOrganizationCtrlEntInd,TransactionWithControlEntInd,TrnsfrExmptNonChrtblRltdOrgInd,ActivitiesConductedPrtshpInd,IRPDocumentCnt,EmployeeCnt,UnrelatedBusIncmOverLimitInd,GoverningBodyVotingMembersCnt,IndependentVotingMemberCnt,FamilyOrBusinessRlnInd,DelegationOfMgmtDutiesInd,ChangeToOrgDocumentsInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,ElectionOfBoardMembersInd,DecisionsSubjectToApprovaInd,MinutesOfGoverningBodyInd,MinutesOfCommitteesInd,OfficerMailingAddressInd,LocalChaptersInd,Form990ProvidedToGvrnBodyInd,ConflictOfInterestPolicyInd,WhistleblowerPolicyInd,DocumentRetentionPolicyInd,CompensationProcessCEOInd,CompensationProcessOtherInd,InvestmentInJointVentureInd,StatesWhereCopyOfReturnIsFldCd,NoListedPersonsCompensatedInd,FormerOfcrEmployeesListedInd,TotalCompGreaterThan150KInd,CompensationFromOtherSrcsInd,MembershipDuesAmt,FundraisingAmt,AllOtherContributionsAmt,TotalContributionsAmt,OtherRevenueTotalAmt,TotalRevenueGrp,FeesForServicesAccountingGrp,OfficeExpensesGrp,InformationTechnologyGrp,ConferencesMeetingsGrp,InsuranceGrp,OtherExpensesGrp,AllOtherExpensesGrp,TotalFunctionalExpensesGrp,CashNonInterestBearingGrp,TotalAssetsGrp,OrgDoesNotFollowSFAS117Ind,RtnEarnEndowmentIncmOthFndsGrp,ReconcilationRevenueExpnssAmt,MethodOfAccountingCashInd,AccountantCompileOrReviewInd,FSAuditedInd,FederalGrantAuditRequiredInd,WebsiteAddressTxt,TotalVolunteersCnt,NetUnrelatedBusTxblIncmAmt,PYContributionsGrantsAmt,PYProgramServiceRevenueAmt,PYInvestmentIncomeAmt,PYOtherRevenueAmt,PYTotalRevenueAmt,PYGrantsAndSimilarPaidAmt,PYBenefitsPaidToMembersAmt,PYSalariesCompEmpBnftPaidAmt,PYTotalProfFndrsngExpnsAmt,PYOtherExpensesAmt,PYTotalExpensesAmt,PYRevenuesLessExpensesAmt,TotalLiabilitiesBOYAmt,ExpenseAmt,GrantAmt,RevenueAmt,ProgSrvcAccomActy2Grp,ProgSrvcAccomActy3Grp,ProgSrvcAccomActyOtherGrp,TotalOtherProgSrvcGrantAmt,TotalProgramServiceExpensesAmt,InfoInScheduleOPartVIInd,AnnualDisclosureCoveredPrsnInd,RegularMonitoringEnfrcInd,UponRequestInd,TotalReportableCompFromOrgAmt,TotReportableCompRltdOrgAmt,TotalOtherCompensationAmt,IndivRcvdGreaterThan100KCnt,CntrctRcvdGreaterThan100KCnt,GovernmentGrantsAmt,TotalProgramServiceRevenueAmt,FundraisingGrossIncomeAmt,ContriRptFundraisingEventAmt,FundraisingDirectExpensesAmt,GrossSalesOfInventoryAmt,CostOfGoodsSoldAmt,GrantsToDomesticIndividualsGrp,CompCurrentOfcrDirectorsGrp,OtherSalariesAndWagesGrp,PensionPlanContributionsGrp,OtherEmployeeBenefitsGrp,PayrollTaxesGrp,FeesForServicesOtherGrp,AdvertisingGrp,TravelGrp,InterestGrp,DepreciationDepletionGrp,SavingsAndTempCashInvstGrp,AccountsReceivableGrp,InventoriesForSaleOrUseGrp,PrepaidExpensesDefrdChargesGrp,LandBldgEquipCostOrOtherBssAmt,LandBldgEquipAccumDeprecAmt,LandBldgEquipBasisNetGrp,InvestmentsOtherSecuritiesGrp,IntangibleAssetsGrp,AccountsPayableAccrExpnssGrp,DeferredRevenueGrp,MortgNotesPyblScrdInvstPropGrp,OtherLiabilitiesGrp,OrganizationFollowsSFAS117Ind,UnrestrictedNetAssetsGrp,TemporarilyRstrNetAssetsGrp,InfoInScheduleOPartXIInd,NetUnrlzdGainsLossesInvstAmt,InfoInScheduleOPartXIIInd,AuditCommitteeInd,AllAffiliatesIncludedInd,GrantsToDomesticOrgsGrp,ForeignGrantsGrp,BenefitsToMembersGrp,CompDisqualPersonsGrp,FeesForServicesManagementGrp,FeesForServicesLegalGrp,FeesForServicesLobbyingGrp,FeesForSrvcInvstMgmntFeesGrp,RoyaltiesGrp,OccupancyGrp,PymtTravelEntrtnmntPubOfclGrp,PaymentsToAffiliatesGrp,PledgesAndGrantsReceivableGrp,RcvblFromDisqualifiedPrsnGrp,OthNotesLoansReceivableNetGrp,InvestmentsPubTradedSecGrp,InvestmentsProgramRelatedGrp,OtherAssetsTotalGrp,TotalOtherProgSrvcExpenseAmt,InfoInScheduleOPartVInd,MethodOfAccountingAccrualInd,NoncashContributionsAmt,GrantsPayableGrp,PermanentlyRstrNetAssetsGrp,TaxExemptBondLiabilitiesGrp,EscrowAccountLiabilityGrp,LoansFromOfficersDirectorsGrp,UnsecuredNotesLoansPayableGrp,PriorPeriodAdjustmentsAmt,FederalGrantAuditPerformedInd,PoliciesReferenceChaptersInd,OtherWebsiteInd,AddressChangeInd,WrittenPolicyOrProcedureInd,RelatedOrganizationsAmt,TotalOtherProgSrvcRevenueAmt,OwnWebsiteInd,TotalJointCostsGrp,DonatedServicesAndUseFcltsAmt,LegalDomicileCountryCd,InfoInScheduleOPartIXInd,TypeOfOrganizationTrustInd,FinalReturnInd,ContractTerminationInd,InfoInScheduleOPartXInd,GroupExemptionNum,InfoInScheduleOPartVIIInd,FederatedCampaignsAmt,TypeOfOrganizationOtherInd,OtherOrganizationDsc,InfoInScheduleOPartVIIIInd,TypeOfOrganizationAssocInd,InitialReturnInd,GamingGrossIncomeAmt,GamingDirectExpensesAmt,MethodOfAccountingOtherInd,InvestmentExpenseAmt,Organization501cInd,Organization4947a1NotPFInd,AmendedReturnInd,SpecialConditionDesc,ActivityCd,Timestamp,TaxPeriodEndDate,TaxPeriodBeginDate,Filer,Officer,TaxYear,BuildTS,ReturnTs,TaxPeriodEndDt,TaxPeriodBeginDt,BusinessOfficerGrp,TaxYr,SigningOfficerGrp,AdditionalFilerInformation
0,NEW ALBANY WALKING CLUB INC,https://s3.amazonaws.com/irs-form-990/201121029349300402_public.xml,93493102004021,201012,203840246,PHILIP HEIT,299757,False,X,NEWALBANYWALKINGCLUB.COM,X,A CLUB DEDICATED TO PROMOTING WALKING FOR HEALTH AND COMPETITION.,7,7,0,0,105000,172910,134744,126722,131,125,2094,0,241969,299757,110600,114700,0,0,0,6717,104478,122499,215078,237199,26891,62558,105105,167663,0,105105,167663,A CLUB DEDICATED TO PROMOTING WALKING FOR HEALTH AND COMPETITION.,False,False,228055,114700,126847,"TO PROMOTE HEALTH AND PREVENT DISEASE, ENCOURAGE A SUPPORTIVE ENVIRONMENT, AND ELEVATE THE STATUS OF WALKING AND ITS BENEFITS TO INDIVIDUALS AND THE CENTRAL OHIO COMMUNITY.",228055,False,False,False,False,False,False,False,False,False,False,False,False,False,0,0,False,X,7,7,True,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,OH,X,X,False,False,False,910,172000,172910,126722,"{'TotalRevenueColumn': '299757', 'RelatedOrExemptFunctionIncome': '126847'}","{'Total': '114700', 'ProgramServices': '114700'}","{'Total': '1000', 'ManagementAndGeneral': '1000'}","{'Total': '75', 'ManagementAndGeneral': '75'}","{'Total': '1352', 'ManagementAndGeneral': '1352'}","{'Total': '1213', 'ProgramServices': '1213'}","[{'Description': 'RACE CLOTHING', 'Total': '37423', 'ProgramServices': '37423'}, {'Description': 'RACE MANAGEMENT FEE', 'Total': '30069', 'ProgramServices': '30069'}, {'Description': 'RACE EXPENSES', 'Total': '17372', 'ProgramServices': '17372'},...","{'Total': '19955', 'ProgramServices': '19955'}","{'Total': '237199', 'ProgramServices': '228055', 'ManagementAndGeneral': '2427', 'Fundraising': '6717'}","{'BOY': '104568', 'EOY': '167341'}",2883,2561,"{'BOY': '537', 'EOY': '322'}","{'BOY': '105105', 'EOY': '167663'}",X,"{'BOY': '105105', 'EOY': '167663'}",62558,X,False,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2011-04-12T08:58:08-05:00,2010-12-31,2010-01-01,"{'EIN': '203840246', 'Name': {'BusinessNameLine1': 'NEW ALBANY WALKING CLUB INC'}, 'NameControl': 'NEWA', 'USAddress': {'AddressLine1': '4000 BAUGHMAN GRANT', 'City': 'NEW ALBANY', 'State': 'OH', 'ZIPCode': '43054'}}","{'Name': 'PHILIP HEIT', 'Title': 'PRESIDENT', 'DateSigned': '2011-03-22', 'AuthorizeThirdParty': 'true'}",2010,2016-02-24 21:20:13Z,,,,,,,


### Create *Fiscal Year* Variable

In [58]:
[c for c in df.columns.tolist() if 'Tax' in c]

['TaxPeriod',
 'PayrollTaxes',
 'TaxExemptBondLiabilities',
 'PayrollTaxesGrp',
 'TaxExemptBondLiabilitiesGrp',
 'TaxPeriodEndDate',
 'TaxPeriodBeginDate',
 'TaxYear',
 'TaxPeriodEndDt',
 'TaxPeriodBeginDt',
 'TaxYr']

<br>We'll run a block of code to show the number of observations that have and are missing values for *TaxPeriod*

In [59]:
%%time
print(len(df[df['TaxPeriod'].notnull()]))
print(len(df[df['TaxPeriod'].isnull()]))

2104435
0
Wall time: 1min 2s


<br>We can show here the top 5 frequencies for this variable.

In [60]:
df['TaxPeriod'].value_counts().head()

201912    146388
201812    145700
201712    138828
201612    131311
201512    124527
Name: TaxPeriod, dtype: int64

<br>Show the data type for *TaxPeriod*. It is ``O``, short for 'object' (string variable).

In [61]:
df['TaxPeriod'].dtype

dtype('O')

<br>We'll create a new variable, *fiscal year*, that comprises the first four characters of the *TaxPeriod* value

In [62]:
df['fiscal_year'] = df['TaxPeriod'].str[:4]
df['fiscal_year'].value_counts()

2019    265281
2018    261382
2017    251401
2016    240304
2015    228000
2014    210538
2013    190710
2012    170761
2020    135692
2011    126923
2010     22562
2021       878
2108         1
2001         1
2000         1
Name: fiscal_year, dtype: int64

<br>To get a round sense of the breakdown of the data by year we will create a new dataset called *years*, rename the first column, sort the dataset and then show the data. You can see here that the single observations for 2000, 2001, and 2108 must be data entry errors. The rest of the values are as expected: the filings run from 2010 through 2021. 

Side note: we will use a different variable in our regressions for tax year. We'll get to that in subsequent notebooks.

In [63]:
%%time
years = pd.DataFrame(df['fiscal_year'].value_counts())
years.index.name = 'year'
years = years.reset_index()
years = years.sort_values('year')
years

Wall time: 349 ms


Unnamed: 0,year,fiscal_year
14,2000,1
13,2001,1
10,2010,22562
9,2011,126923
7,2012,170761
6,2013,190710
5,2014,210538
4,2015,228000
3,2016,240304
2,2017,251401


In [64]:
print("Number of columns:", len(df.columns))
print("Number of observations:", len(df))
df[:1]    

Number of columns: 487
Number of observations: 2104435


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,TypeOfOrganizationCorporation,ActivityOrMissionDescription,NbrVotingMembersGoverningBody,NbrIndependentVotingMembers,TotalNbrEmployees,TotalGrossUBI,ContributionsGrantsPriorYear,ContributionsGrantsCurrentYear,ProgramServiceRevenuePriorYear,ProgramServiceRevenueCY,InvestmentIncomePriorYear,InvestmentIncomeCurrentYear,OtherRevenuePriorYear,OtherRevenueCurrentYear,TotalRevenuePriorYear,TotalRevenueCurrentYear,GrantsAndSimilarAmntsPriorYear,GrantsAndSimilarAmntsCY,BenefitsPaidToMembersCY,SalariesEtcCurrentYear,TotalProfFundrsngExpCY,TotalFundrsngExpCurrentYear,OtherExpensePriorYear,OtherExpensesCurrentYear,TotalExpensesPriorYear,TotalExpensesCurrentYear,RevenuesLessExpensesPriorYear,RevenuesLessExpensesCY,TotalAssetsBOY,TotalAssetsEOY,TotalLiabilitiesEOY,NetAssetsOrFundBalancesBOY,NetAssetsOrFundBalancesEOY,MissionDescription,SignificantNewProgramServices,SignificantChange,Expense,Grants,Revenue,Description,TotalProgramServiceExpense,PoliticalActivities,LobbyingActivities,ProfessionalFundraising,FundraisingActivities,Gaming,ExcessBenefitTransaction,PriorExcessBenefitTransaction,DisregardedEntity,RelatedEntity,RelatedOrgControlledEntity,TransactionRelatedEntity,TransfersToExemptNonChrtblOrg,ActivitiesConductedPartnership,NumberFormsTransmittedWith1096,NumberOfEmployees,UnrelatedBusinessIncome,InfoInScheduleOPartVI,NbrVotingGoverningBodyMembers,NumberIndependentVotingMembers,FamilyOrBusinessRelationship,DelegationOfManagementDuties,ChangesToOrganizingDocs,MaterialDiversionOrMisuse,MembersOrStockholders,ElectionOfBoardMembers,DecisionsSubjectToApproval,MinutesOfGoverningBody,MinutesOfCommittees,OfficerMailingAddress,LocalChapters,Form990ProvidedToGoverningBody,ConflictOfInterestPolicy,WhistleblowerPolicy,DocumentRetentionPolicy,CompensationProcessCEO,CompensationProcessOther,InvestmentInJointVenture,StatesWhereCopyOfReturnIsFiled,UponRequest,NoListedPersonsCompensated,FormersListed,TotalCompGT150K,CompensationFromOtherSources,MembershipDues,AllOtherContributions,TotalContributions,TotalProgramServiceRevenue,TotalRevenue,GrantsToDomesticOrgs,FeesForServicesAccounting,FeesForServicesOther,OfficeExpenses,DepreciationDepletion,OtherExpenses,AllOtherExpenses,TotalFunctionalExpenses,CashNonInterestBearing,LandBuildingsEquipmentBasis,LandBldgEquipmentAccumDeprec,LandBuildingsEquipmentBasisNet,TotalAssets,DoNotFollowSFAS117,RetainedEarningsEndowmentEtc,ReconcilationRevenueExpenses,MethodOfAccountingCash,AccountantCompileOrReview,FSAudited,Organization501c,YearFormation,StateLegalDomicile,TotalNbrVolunteers,SalariesEtcPriorYear,TotalLiabilitiesBOY,Activity2,Activity3,ActivityOther,TotalOfOtherProgramServiceExp,AnnualDisclosureCoveredPersons,RegularMonitoringEnforcement,OtherWebsite,TotalReportableCompFromOrg,TotalOtherCompensation,CompCurrentOfficersDirectors,OtherSalariesAndWages,PensionPlanContributions,OtherEmployeeBenefits,PayrollTaxes,FeesForServicesLegal,FeesForServicesProfFundraising,Advertising,InformationTechnology,Occupancy,Travel,ConferencesMeetings,Interest,Insurance,PledgesAndGrantsReceivable,AccountsReceivable,PrepaidExpensesDeferredCharges,AccountsPayableAccruedExpenses,LoansFromOfficersDirectors,MortNotesPyblSecuredInvestProp,OtherLiabilities,FollowSFAS117,UnrestrictedNetAssets,TemporarilyRestrictedNetAssets,PermanentlyRestrictedNetAssets,MethodOfAccountingAccrual,AuditCommittee,FederalGrantAuditRequired,FederalGrantAuditPerformed,TypeOfOrganizationTrust,NetUnrelatedBusinessTxblIncome,BenefitsPaidToMembersPriorYear,TotalReportableCompFrmRltdOrgs,NumberIndividualsGT100K,NumberOfContractorsGT100K,TotalOtherRevenue,BenefitsToMembers,FeesForServicesManagement,FeesForServicesInvstMgmntFees,InvestmentsPubTradedSecurities,OtherAssetsTotal,DeferredRevenue,SavingsAndTempCashInvestments,InventoriesForSaleOrUse,AllAffiliatesIncluded,PoliciesReferenceChapters,WrittenPolicyOrProcedure,GovernmentGrants,GrantsToDomesticIndividuals,ForeignGrants,CompDisqualPersons,FeesForServicesLobbying,Royalties,TravelEntrtnmntPublicOfficials,PaymentsToAffiliates,ReceivablesFromDisqualPersons,OtherNotesLoansReceivableNet,InvestmentsOtherSecurities,InvestmentsProgramRelated,IntangibleAssets,AmendedReturn,TotalProfFundrsngExpPriorYear,InfoInScheduleOPartIII,InfoInScheduleOPartXII,FundraisingEvents,CntrbtnsRprtdFundraisingEvents,TypeOfOrganizationOther,TypeOfOrgOtherDescription,TypeOfOrganizationAssociation,InfoInScheduleOPartXI,GrossIncomeFundraisingEvents,FundraisingDirectExpenses,InitialReturn,TotalOfOtherProgramServiceRev,GroupExemptionNumber,UnsecuredNotesLoansPayable,NoncashContributions,TotalOfOtherProgramServiceGrnt,GrantsPayable,TaxExemptBondLiabilities,EscrowAccountLiability,OwnWebsite,AddressChange,GrossSalesOfInventory,CostOfGoodsSold,GrossIncomeGaming,GamingDirectExpenses,MethodOfAccountingOther,TotalJointCosts,RelatedOrganizations,InfoInScheduleOPartVII,FederatedCampaigns,InfoInScheduleOPartV,TerminatedReturn,TerminationOrContraction,ActivityCode,Organization4947a1,SpecialConditionDescription,CountryLegalDomicile,InfoInScheduleOPartIX,ReconciliationUnrealizedInvest,ReconcilationPriorAdjustment,ReconcilationDonatedServices,ReconcilationInvestExpenses,InfoInScheduleOPartVIII,InfoInScheduleOPartX,PrincipalOfficerNm,GrossReceiptsAmt,GroupReturnForAffiliatesInd,Organization501c3Ind,TypeOfOrganizationCorpInd,FormationYr,LegalDomicileStateCd,ActivityOrMissionDesc,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,TotalEmployeeCnt,TotalGrossUBIAmt,CYContributionsGrantsAmt,CYProgramServiceRevenueAmt,CYInvestmentIncomeAmt,CYOtherRevenueAmt,CYTotalRevenueAmt,CYGrantsAndSimilarPaidAmt,CYBenefitsPaidToMembersAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalProfFndrsngExpnsAmt,CYTotalFundraisingExpenseAmt,CYOtherExpensesAmt,CYTotalExpensesAmt,CYRevenuesLessExpensesAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalLiabilitiesEOYAmt,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,InfoInScheduleOPartIIIInd,MissionDesc,SignificantNewProgramSrvcInd,SignificantChangeInd,Desc,PoliticalCampaignActyInd,LobbyingActivitiesInd,ProfessionalFundraisingInd,FundraisingActivitiesInd,GamingActivitiesInd,EngagedInExcessBenefitTransInd,PYExcessBenefitTransInd,DisregardedEntityInd,RelatedEntityInd,RelatedOrganizationCtrlEntInd,TransactionWithControlEntInd,TrnsfrExmptNonChrtblRltdOrgInd,ActivitiesConductedPrtshpInd,IRPDocumentCnt,EmployeeCnt,UnrelatedBusIncmOverLimitInd,GoverningBodyVotingMembersCnt,IndependentVotingMemberCnt,FamilyOrBusinessRlnInd,DelegationOfMgmtDutiesInd,ChangeToOrgDocumentsInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,ElectionOfBoardMembersInd,DecisionsSubjectToApprovaInd,MinutesOfGoverningBodyInd,MinutesOfCommitteesInd,OfficerMailingAddressInd,LocalChaptersInd,Form990ProvidedToGvrnBodyInd,ConflictOfInterestPolicyInd,WhistleblowerPolicyInd,DocumentRetentionPolicyInd,CompensationProcessCEOInd,CompensationProcessOtherInd,InvestmentInJointVentureInd,StatesWhereCopyOfReturnIsFldCd,NoListedPersonsCompensatedInd,FormerOfcrEmployeesListedInd,TotalCompGreaterThan150KInd,CompensationFromOtherSrcsInd,MembershipDuesAmt,FundraisingAmt,AllOtherContributionsAmt,TotalContributionsAmt,OtherRevenueTotalAmt,TotalRevenueGrp,FeesForServicesAccountingGrp,OfficeExpensesGrp,InformationTechnologyGrp,ConferencesMeetingsGrp,InsuranceGrp,OtherExpensesGrp,AllOtherExpensesGrp,TotalFunctionalExpensesGrp,CashNonInterestBearingGrp,TotalAssetsGrp,OrgDoesNotFollowSFAS117Ind,RtnEarnEndowmentIncmOthFndsGrp,ReconcilationRevenueExpnssAmt,MethodOfAccountingCashInd,AccountantCompileOrReviewInd,FSAuditedInd,FederalGrantAuditRequiredInd,WebsiteAddressTxt,TotalVolunteersCnt,NetUnrelatedBusTxblIncmAmt,PYContributionsGrantsAmt,PYProgramServiceRevenueAmt,PYInvestmentIncomeAmt,PYOtherRevenueAmt,PYTotalRevenueAmt,PYGrantsAndSimilarPaidAmt,PYBenefitsPaidToMembersAmt,PYSalariesCompEmpBnftPaidAmt,PYTotalProfFndrsngExpnsAmt,PYOtherExpensesAmt,PYTotalExpensesAmt,PYRevenuesLessExpensesAmt,TotalLiabilitiesBOYAmt,ExpenseAmt,GrantAmt,RevenueAmt,ProgSrvcAccomActy2Grp,ProgSrvcAccomActy3Grp,ProgSrvcAccomActyOtherGrp,TotalOtherProgSrvcGrantAmt,TotalProgramServiceExpensesAmt,InfoInScheduleOPartVIInd,AnnualDisclosureCoveredPrsnInd,RegularMonitoringEnfrcInd,UponRequestInd,TotalReportableCompFromOrgAmt,TotReportableCompRltdOrgAmt,TotalOtherCompensationAmt,IndivRcvdGreaterThan100KCnt,CntrctRcvdGreaterThan100KCnt,GovernmentGrantsAmt,TotalProgramServiceRevenueAmt,FundraisingGrossIncomeAmt,ContriRptFundraisingEventAmt,FundraisingDirectExpensesAmt,GrossSalesOfInventoryAmt,CostOfGoodsSoldAmt,GrantsToDomesticIndividualsGrp,CompCurrentOfcrDirectorsGrp,OtherSalariesAndWagesGrp,PensionPlanContributionsGrp,OtherEmployeeBenefitsGrp,PayrollTaxesGrp,FeesForServicesOtherGrp,AdvertisingGrp,TravelGrp,InterestGrp,DepreciationDepletionGrp,SavingsAndTempCashInvstGrp,AccountsReceivableGrp,InventoriesForSaleOrUseGrp,PrepaidExpensesDefrdChargesGrp,LandBldgEquipCostOrOtherBssAmt,LandBldgEquipAccumDeprecAmt,LandBldgEquipBasisNetGrp,InvestmentsOtherSecuritiesGrp,IntangibleAssetsGrp,AccountsPayableAccrExpnssGrp,DeferredRevenueGrp,MortgNotesPyblScrdInvstPropGrp,OtherLiabilitiesGrp,OrganizationFollowsSFAS117Ind,UnrestrictedNetAssetsGrp,TemporarilyRstrNetAssetsGrp,InfoInScheduleOPartXIInd,NetUnrlzdGainsLossesInvstAmt,InfoInScheduleOPartXIIInd,AuditCommitteeInd,AllAffiliatesIncludedInd,GrantsToDomesticOrgsGrp,ForeignGrantsGrp,BenefitsToMembersGrp,CompDisqualPersonsGrp,FeesForServicesManagementGrp,FeesForServicesLegalGrp,FeesForServicesLobbyingGrp,FeesForSrvcInvstMgmntFeesGrp,RoyaltiesGrp,OccupancyGrp,PymtTravelEntrtnmntPubOfclGrp,PaymentsToAffiliatesGrp,PledgesAndGrantsReceivableGrp,RcvblFromDisqualifiedPrsnGrp,OthNotesLoansReceivableNetGrp,InvestmentsPubTradedSecGrp,InvestmentsProgramRelatedGrp,OtherAssetsTotalGrp,TotalOtherProgSrvcExpenseAmt,InfoInScheduleOPartVInd,MethodOfAccountingAccrualInd,NoncashContributionsAmt,GrantsPayableGrp,PermanentlyRstrNetAssetsGrp,TaxExemptBondLiabilitiesGrp,EscrowAccountLiabilityGrp,LoansFromOfficersDirectorsGrp,UnsecuredNotesLoansPayableGrp,PriorPeriodAdjustmentsAmt,FederalGrantAuditPerformedInd,PoliciesReferenceChaptersInd,OtherWebsiteInd,AddressChangeInd,WrittenPolicyOrProcedureInd,RelatedOrganizationsAmt,TotalOtherProgSrvcRevenueAmt,OwnWebsiteInd,TotalJointCostsGrp,DonatedServicesAndUseFcltsAmt,LegalDomicileCountryCd,InfoInScheduleOPartIXInd,TypeOfOrganizationTrustInd,FinalReturnInd,ContractTerminationInd,InfoInScheduleOPartXInd,GroupExemptionNum,InfoInScheduleOPartVIIInd,FederatedCampaignsAmt,TypeOfOrganizationOtherInd,OtherOrganizationDsc,InfoInScheduleOPartVIIIInd,TypeOfOrganizationAssocInd,InitialReturnInd,GamingGrossIncomeAmt,GamingDirectExpensesAmt,MethodOfAccountingOtherInd,InvestmentExpenseAmt,Organization501cInd,Organization4947a1NotPFInd,AmendedReturnInd,SpecialConditionDesc,ActivityCd,Timestamp,TaxPeriodEndDate,TaxPeriodBeginDate,Filer,Officer,TaxYear,BuildTS,ReturnTs,TaxPeriodEndDt,TaxPeriodBeginDt,BusinessOfficerGrp,TaxYr,SigningOfficerGrp,AdditionalFilerInformation,fiscal_year
0,NEW ALBANY WALKING CLUB INC,https://s3.amazonaws.com/irs-form-990/201121029349300402_public.xml,93493102004021,201012,203840246,PHILIP HEIT,299757,False,X,NEWALBANYWALKINGCLUB.COM,X,A CLUB DEDICATED TO PROMOTING WALKING FOR HEALTH AND COMPETITION.,7,7,0,0,105000,172910,134744,126722,131,125,2094,0,241969,299757,110600,114700,0,0,0,6717,104478,122499,215078,237199,26891,62558,105105,167663,0,105105,167663,A CLUB DEDICATED TO PROMOTING WALKING FOR HEALTH AND COMPETITION.,False,False,228055,114700,126847,"TO PROMOTE HEALTH AND PREVENT DISEASE, ENCOURAGE A SUPPORTIVE ENVIRONMENT, AND ELEVATE THE STATUS OF WALKING AND ITS BENEFITS TO INDIVIDUALS AND THE CENTRAL OHIO COMMUNITY.",228055,False,False,False,False,False,False,False,False,False,False,False,False,False,0,0,False,X,7,7,True,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,OH,X,X,False,False,False,910,172000,172910,126722,"{'TotalRevenueColumn': '299757', 'RelatedOrExemptFunctionIncome': '126847'}","{'Total': '114700', 'ProgramServices': '114700'}","{'Total': '1000', 'ManagementAndGeneral': '1000'}","{'Total': '75', 'ManagementAndGeneral': '75'}","{'Total': '1352', 'ManagementAndGeneral': '1352'}","{'Total': '1213', 'ProgramServices': '1213'}","[{'Description': 'RACE CLOTHING', 'Total': '37423', 'ProgramServices': '37423'}, {'Description': 'RACE MANAGEMENT FEE', 'Total': '30069', 'ProgramServices': '30069'}, {'Description': 'RACE EXPENSES', 'Total': '17372', 'ProgramServices': '17372'},...","{'Total': '19955', 'ProgramServices': '19955'}","{'Total': '237199', 'ProgramServices': '228055', 'ManagementAndGeneral': '2427', 'Fundraising': '6717'}","{'BOY': '104568', 'EOY': '167341'}",2883,2561,"{'BOY': '537', 'EOY': '322'}","{'BOY': '105105', 'EOY': '167663'}",X,"{'BOY': '105105', 'EOY': '167663'}",62558,X,False,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2011-04-12T08:58:08-05:00,2010-12-31,2010-01-01,"{'EIN': '203840246', 'Name': {'BusinessNameLine1': 'NEW ALBANY WALKING CLUB INC'}, 'NameControl': 'NEWA', 'USAddress': {'AddressLine1': '4000 BAUGHMAN GRANT', 'City': 'NEW ALBANY', 'State': 'OH', 'ZIPCode': '43054'}}","{'Name': 'PHILIP HEIT', 'Title': 'PRESIDENT', 'DateSigned': '2011-03-22', 'AuthorizeThirdParty': 'true'}",2010,2016-02-24 21:20:13Z,,,,,,,,2010


### Initital Verifications - Check whether ``df`` contains all relevant columns
First, we'll create a Python *list* that contains the ID columns we added to the top line of our ``cursor`` earlier.

In [65]:
id_cols = ['DLN', 'EIN', 'URL', 'OrganizationName']

<br>Here we take advantage of Python's 'set' capabilities to compare the columns in our dataset to the columns we are expecting, which are represented by the *mongo_cols* and *id_cols* lists we have created. 

The first line below uses the ``len`` function to tell us how many columns in our dataframe are not in *mongo_cols* or *id_cols*. The answer is 3. And the second line shows us what those columns are.

In [71]:
print(len(set(df.columns.tolist())) - len(set(mongo_cols)) - len(set(id_cols)))
set(df.columns.tolist()) - set(mongo_cols) - set(id_cols)

3


{'AdditionalFilerInformation', 'SigningOfficerGrp', 'fiscal_year'}

<br>Let's see two sample rows for these three columns. 

In [72]:
df[['AdditionalFilerInformation', 'SigningOfficerGrp', 'fiscal_year']].sample(2)

Unnamed: 0,AdditionalFilerInformation,SigningOfficerGrp,fiscal_year
2054926,,"{'PersonFullName': {'PersonFirstNm': 'SUSAN', 'PersonLastNm': 'JOHNSON'}, 'SSN': '999009999'}",2020
1254788,,,2017


<br>I don't want to use the first two of these variables so let's drop them from our dataframe.

In [75]:
df = df.drop('AdditionalFilerInformation', axis=1)
df = df.drop('SigningOfficerGrp', axis=1)

<br>Check whether any columns in *mongo_cols* are missing from our dataframe. There are none, so we have extracted all variables we expected to grab.

In [79]:
set(mongo_cols) - set(df.columns.tolist())

set()

<br>Show number of observations

In [82]:
len(df)

2104435

<br>Show number of columns

In [83]:
len(df.columns)

485

### Save DF

In [84]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df.to_pickle('all filings August 2022 - all control variables.pkl.gz', compression='gzip')

Current date and time :  2022-08-03 13:48:38 

Wall time: 39min 34s
