## Former name:

- *IRS Form 990 e-File Data (3) -- Extract All Variables FROM DOWNLOADED XML FILINGS.ipynb*

### Note 4/14/2025: This needs a lot of cleaning

Future runs: Delete the `EIN` that comes from the initial download -- or better yet, don't read it in in the first place (see code and notes at end of notebook)

# Differences from AWS Version
- There is no *TaxPeriod* variable
    - instead, there is *TaxPeriodEndDt* (which was in AWS filings as well)
    
https://www.irs.gov/charities-non-profits/form-990-series-downloads

Also note a lot of columns from this are missing:
``set(mongo_cols) - set(df.columns.tolist())``
- It might not be a problem though -- because while there is not, for example, *AddressChange*, there is *AddressChangeInd*.
- Also, see the bottom of notebook (5) in this series; it looks like we have all the variables. 

# Overview

This is the third in a series of tutorials that illustrate how to download, extract, and parse the IRS 990 e-file data available at https://aws.amazon.com/public-data-sets/irs-990/

In the previous notebook we downloaded all 990 filings into a MongoDB database. The goal of this notebook is to extract the JSON data into a Python PANDAS dataset, which will be our dataset of choice for all future analyses. 

The 990 e-file data contains myriad variables, each of which has to be verified before extracting and analyzing. Working with Jesse Lecy at Arizona State and others, a group of us has come up with a "concordance" file containing the *xpath* of all verified variables. Among other things, this concordance file maps the specific lines from the Form 990 to the xpaths in the XML file. Accordingly, our first step will be to read in the concordance file that has **_all_** reconciled and verified variables to date:
- The file is called *concordance_VERIFIED.xlsx*

I then connect with the *MongoDB* database and import all verified variables into a PANDAS dataframe. 

I then also 'flatten' the *ReturnHeader* column (and delete unneeded *ReturnHeader* columns).

Lastly, I save the following file:
- *all filings August 2022 - all control variables.pkl.gz*

I following notebooks I will combine and rename columns, binarize variables, etc. 

# Load Packages and Connect to MongoDB

First, we will add a datestamp and then import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations. It is invaluable for analyzing datasets. 

In [1]:
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

Current date and time :  2025-04-14 22:46:27


In [2]:
import sys
import time
import json

In [3]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
We can check which version of various packages we're using. You can see I'm running PANDAS 1.4.1 here.

In [8]:
print(pd.__version__)

2.2.2


<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [5]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)

#### Set working directory

In [6]:
cd "C:\\Users\\Gregory\\IRS 990 Control Variables\\"

C:\Users\Gregory\IRS 990 Control Variables


#### MongoDB
Depending on the project, I will store the data in SQLite or MongoDB. With this project I'm using MongoDB -- it's great for storing JSON data where each observation could have different variables. Before we get to the interesting part the following code blocks set up the MongoDB environment and the new database we'll be using. 

**_Note:_** In a terminal you'll have to start MongoDB by running the command *mongod* or *sudo mongod*. Then we run the following code block here to access MongoDB.

In [7]:
import pymongo
from pymongo import MongoClient
client = MongoClient()

In [8]:
print(pymongo.__version__)

4.3.3


<br>Get a list of all databases

In [9]:
MongoClient().list_database_names()

['ICIJ',
 'OWS',
 'SMC',
 'admin',
 'cashtags',
 'config',
 'enron',
 'irs_990_db',
 'irs_990_db_v2',
 'local',
 'paradisepapers',
 'sec',
 'sp1500',
 'sp500']

##### Let's define the database and collection/table we created in the previous notebook for storing the 990 filings.

In [10]:
# DEFINE MY mongoDB DATABASE
db = client['irs_990_db']

# DEFINE MY COLLECTION HOUSING 990 DATA
filings_990 = db['filings_990']

<br>When we set up our database in an earlier tutorial, we set a unique constraint on the collection based on *URL*. This averted duplicates from being inserted. Uncomment following line if index not yet created.

In [11]:
#db.filings_990.create_index([('URL', pymongo.ASCENDING)], unique=True)

<br>Show the index. We can see that as expected *URL* is an index item.

In [12]:
list(db.filings_990.index_information())

['_id_', 'URL_1']

<br>Check how many observations in the database table.

In [13]:
filings_990.estimated_document_count()

3469008

<br>Show one filing in the database. You can see here the data is in JSON format. In this notebook we will be converting these filings to a typical 'flat' (one variable per column) database.

In [13]:
db.filings_990.find_one({'URL' : "https://s3.amazonaws.com/irs-form-990/202013509349300506_public.xml" })

{'_id': ObjectId('6437298d230f99d484959c69'),
 '@xmlns': 'http://www.irs.gov/efile',
 '@xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance',
 '@xsi:schemaLocation': 'http://www.irs.gov/efile',
 '@returnVersion': '2019v5.1',
 'ReturnHeader': {'@binaryAttachmentCnt': '0',
  'ReturnTs': '2020-12-15T08:16:09-06:00',
  'TaxPeriodEndDt': '2020-06-30',
  'PreparerFirmGrp': {'PreparerFirmEIN': '030340114',
   'PreparerFirmName': {'BusinessNameLine1Txt': 'MUDGETT JENNETT & KROGH-WISNER PC'},
   'PreparerUSAddress': {'AddressLine1Txt': 'PO BOX 937',
    'CityNm': 'MONTPELIER',
    'StateAbbreviationCd': 'VT',
    'ZIPCd': '056010937'}},
  'ReturnTypeCd': '990',
  'TaxPeriodBeginDt': '2019-07-01',
  'Filer': {'EIN': '141901993',
   'BusinessName': {'BusinessNameLine1Txt': 'VERMONT FEDERATION OF NURSES &',
    'BusinessNameLine2Txt': 'HEALTHCARE PROFESSIONALS AFT INC'},
   'BusinessNameControlTxt': 'VERM',
   'PhoneNum': '8026574040',
   'USAddress': {'AddressLine1Txt': '121 PARK AVENUE NO 10'

<br>We can also just show the 'keys', or variable names, for one filing. You can see the huge number of variables available.

In [14]:
for f in filings_990.find({})[:1]:
    print(sorted(f.keys()))

['@documentCount', '@documentId', '@referenceDocumentId', '@returnVersion', '@xmlns', '@xmlns:xsi', '@xsi:schemaLocation', 'AccountantCompileOrReview', 'AccountsPayableAccruedExpenses', 'AccountsReceivable', 'ActivitiesConductedPartnership', 'ActivityOrMissionDescription', 'AddressChange', 'AddressPrincipalOfficerUS', 'AllOtherContributions', 'AllOtherExpenses', 'AnnualDisclosureCoveredPersons', 'AuditCommittee', 'BenefitsPaidToMembersCY', 'BenefitsPaidToMembersPriorYear', 'BsnssRltnshpThruFamilyMember', 'BsnssRltnshpWithOrganization', 'ChangesToOrganizingDocs', 'CollectionsOfArt', 'CompensationFromOtherSources', 'CompensationProcessCEO', 'CompensationProcessOther', 'ComplianceWithBackupWitholding', 'ConflictOfInterestPolicy', 'ConservationEasements', 'ConsolidatedAuditFinancialStmt', 'ContributionsGrantsCurrentYear', 'ContributionsGrantsPriorYear', 'CreditCounseling', 'DLN', 'DecisionsSubjectToApproval', 'DeductibleContributionsOfArt', 'DeductibleNonCashContributions', 'DelegationOfMa

# Read in Concordance File
We are going to read in a 'concordance' file. In this notebook we are interested in the *xpaths* for these variables -- in general, each 990 variable will have two different *xpaths* that vary according to year. These *xpaths* allow us to identify the location of the variables in each filing. In a following notebook, we will be using the *new_variable_name* field as our variable name. There are other relevant columns in the concordance file, which we'll cover in subsequent notebooks. 

In [7]:
import pandas as ppd  # <-- temp alias just for this read

concordance = ppd.read_excel("concordance_VERIFIED.xlsx")
print('# of columns:', len(concordance.columns))
print('# of observations:', len(concordance))
concordance[:2]

# of columns: 17
# of observations: 574


Unnamed: 0,xpath,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,python_data_type,fill_null,BINARIZE,MongoDB_Name,sub_key,sub_sub_key,cardinality
0,/Return/ReturnData/IRS990/SpecialConditionDesc,F9_00_HD_SPECIAL_CONDITION_DESC,,,,,Special condition description,F990-PC-PART-00,PART-00,TextType,string,Do not fill null,,SpecialConditionDesc,,,
1,/Return/ReturnData/IRS990/SpecialConditionDescription,F9_00_HD_SPECIAL_CONDITION_DESC,31.0,,,,Special condition description,F990-PC-PART-00,PART-00,TextType,string,Do not fill null,,SpecialConditionDescription,,,


<br>Check *MongoDB_Name*. This name was taken from the *xpath* column.

In [9]:
print(len(concordance['MongoDB_Name'].tolist()))
print(len(set(concordance['MongoDB_Name'].tolist())))

574
480


<br>Create a list, ``mongo_cols``, that contains a list of all variable names in the concordance files. There are 480 different variables. These are our target variables -- the ones we'll extract for each filing from MongoDB.

In [10]:
mongo_cols = concordance[:]['MongoDB_Name'].tolist()
print(len(mongo_cols))
print(len(set(mongo_cols)))
mongo_cols = list(set(mongo_cols))
print(len(mongo_cols))
print(mongo_cols[:5])

574
480
480
['ActivitiesConductedPrtshpInd', 'NetAssetsOrFundBalancesBOYAmt', 'VotingMembersGoverningBodyCnt', 'AnnualDisclosureCoveredPersons', 'ActivitiesConductedPartnership']


# Extract Data from MongoDB Databse

Print out a sorted list of our desired columns. Here we do a 'list comprehension' to remove ``nan`` values from the list we just created. 

In [11]:
mongo_cols = [x for x in mongo_cols if str(x) != 'nan']
print(len(mongo_cols))

480


In [12]:
print(len(sorted(mongo_cols)))

480


<br>Use 'helper' loop to print out variables for MongoDB -- we'll copy and paste this into a subsequent block of code.

In [20]:
for c in sorted(mongo_cols):
    print("    '"+c+"'"+': 1, ')

    'AccountantCompileOrReview': 1, 
    'AccountantCompileOrReviewInd': 1, 
    'AccountsPayableAccrExpnssGrp': 1, 
    'AccountsPayableAccruedExpenses': 1, 
    'AccountsReceivable': 1, 
    'AccountsReceivableGrp': 1, 
    'ActivitiesConductedPartnership': 1, 
    'ActivitiesConductedPrtshpInd': 1, 
    'Activity2': 1, 
    'Activity3': 1, 
    'ActivityCd': 1, 
    'ActivityCode': 1, 
    'ActivityOrMissionDesc': 1, 
    'ActivityOrMissionDescription': 1, 
    'ActivityOther': 1, 
    'AddressChange': 1, 
    'AddressChangeInd': 1, 
    'Advertising': 1, 
    'AdvertisingGrp': 1, 
    'AllAffiliatesIncluded': 1, 
    'AllAffiliatesIncludedInd': 1, 
    'AllOtherContributions': 1, 
    'AllOtherContributionsAmt': 1, 
    'AllOtherExpenses': 1, 
    'AllOtherExpensesGrp': 1, 
    'AmendedReturn': 1, 
    'AmendedReturnInd': 1, 
    'AnnualDisclosureCoveredPersons': 1, 
    'AnnualDisclosureCoveredPrsnInd': 1, 
    'AuditCommittee': 1, 
    'AuditCommitteeInd': 1, 
    'BenefitsPaidTo

<br>The 'helper' function output we created above is copy-and-pasted in here into a variable called ``cursor``. Note that in the first row we also include five identifier columns (*EIN, OrganizationName, DLN*, *URL*, and *ReturnHeader*). We also include *_id* (a MongoDB column) with a '0' tag, meaning we don't want this otherwise automatically included column.

In [32]:
cursor = filings_990.find({}, {'_id': 0, 'EIN': 1, 'OrganizationName': 1, 'DLN': 1, 'URL': 1,  'ReturnHeader': 1,
    'AccountantCompileOrReview': 1, 
    'AccountantCompileOrReviewInd': 1, 
    'AccountsPayableAccrExpnssGrp': 1, 
    'AccountsPayableAccruedExpenses': 1, 
    'AccountsReceivable': 1, 
    'AccountsReceivableGrp': 1, 
    'ActivitiesConductedPartnership': 1, 
    'ActivitiesConductedPrtshpInd': 1, 
    'Activity2': 1, 
    'Activity3': 1, 
    'ActivityCd': 1, 
    'ActivityCode': 1, 
    'ActivityOrMissionDesc': 1, 
    'ActivityOrMissionDescription': 1, 
    'ActivityOther': 1, 
    'AddressChange': 1, 
    'AddressChangeInd': 1, 
    'Advertising': 1, 
    'AdvertisingGrp': 1, 
    'AllAffiliatesIncluded': 1, 
    'AllAffiliatesIncludedInd': 1, 
    'AllOtherContributions': 1, 
    'AllOtherContributionsAmt': 1, 
    'AllOtherExpenses': 1, 
    'AllOtherExpensesGrp': 1, 
    'AmendedReturn': 1, 
    'AmendedReturnInd': 1, 
    'AnnualDisclosureCoveredPersons': 1, 
    'AnnualDisclosureCoveredPrsnInd': 1, 
    'AuditCommittee': 1, 
    'AuditCommitteeInd': 1, 
    'BenefitsPaidToMembersCY': 1, 
    'BenefitsPaidToMembersPriorYear': 1, 
    'BenefitsToMembers': 1, 
    'BenefitsToMembersGrp': 1, 
    'BuildTS': 1, 
    'BusinessOfficerGrp': 1, 
    'CYBenefitsPaidToMembersAmt': 1, 
    'CYContributionsGrantsAmt': 1, 
    'CYGrantsAndSimilarPaidAmt': 1, 
    'CYInvestmentIncomeAmt': 1, 
    'CYOtherExpensesAmt': 1, 
    'CYOtherRevenueAmt': 1, 
    'CYProgramServiceRevenueAmt': 1, 
    'CYRevenuesLessExpensesAmt': 1, 
    'CYSalariesCompEmpBnftPaidAmt': 1, 
    'CYTotalExpensesAmt': 1, 
    'CYTotalFundraisingExpenseAmt': 1, 
    'CYTotalProfFndrsngExpnsAmt': 1, 
    'CYTotalRevenueAmt': 1, 
    'CashNonInterestBearing': 1, 
    'CashNonInterestBearingGrp': 1, 
    'ChangeToOrgDocumentsInd': 1, 
    'ChangesToOrganizingDocs': 1, 
    'CntrbtnsRprtdFundraisingEvents': 1, 
    'CntrctRcvdGreaterThan100KCnt': 1, 
    'CompCurrentOfcrDirectorsGrp': 1, 
    'CompCurrentOfficersDirectors': 1, 
    'CompDisqualPersons': 1, 
    'CompDisqualPersonsGrp': 1, 
    'CompensationFromOtherSources': 1, 
    'CompensationFromOtherSrcsInd': 1, 
    'CompensationProcessCEO': 1, 
    'CompensationProcessCEOInd': 1, 
    'CompensationProcessOther': 1, 
    'CompensationProcessOtherInd': 1, 
    'ConferencesMeetings': 1, 
    'ConferencesMeetingsGrp': 1, 
    'ConflictOfInterestPolicy': 1, 
    'ConflictOfInterestPolicyInd': 1, 
    'ContractTerminationInd': 1, 
    'ContriRptFundraisingEventAmt': 1, 
    'ContributionsGrantsCurrentYear': 1, 
    'ContributionsGrantsPriorYear': 1, 
    'CostOfGoodsSold': 1, 
    'CostOfGoodsSoldAmt': 1, 
    'CountryLegalDomicile': 1, 
    'DecisionsSubjectToApprovaInd': 1, 
    'DecisionsSubjectToApproval': 1, 
    'DeferredRevenue': 1, 
    'DeferredRevenueGrp': 1, 
    'DelegationOfManagementDuties': 1, 
    'DelegationOfMgmtDutiesInd': 1, 
    'DepreciationDepletion': 1, 
    'DepreciationDepletionGrp': 1, 
    'Desc': 1, 
    'Description': 1, 
    'DisregardedEntity': 1, 
    'DisregardedEntityInd': 1, 
    'DoNotFollowSFAS117': 1, 
    'DocumentRetentionPolicy': 1, 
    'DocumentRetentionPolicyInd': 1, 
    'DonatedServicesAndUseFcltsAmt': 1, 
    'ElectionOfBoardMembers': 1, 
    'ElectionOfBoardMembersInd': 1, 
    'EmployeeCnt': 1, 
    'EngagedInExcessBenefitTransInd': 1, 
    'EscrowAccountLiability': 1, 
    'EscrowAccountLiabilityGrp': 1, 
    'ExcessBenefitTransaction': 1, 
    'Expense': 1, 
    'ExpenseAmt': 1, 
    'FSAudited': 1, 
    'FSAuditedInd': 1, 
    'FamilyOrBusinessRelationship': 1, 
    'FamilyOrBusinessRlnInd': 1, 
    'FederalGrantAuditPerformed': 1, 
    'FederalGrantAuditPerformedInd': 1, 
    'FederalGrantAuditRequired': 1, 
    'FederalGrantAuditRequiredInd': 1, 
    'FederatedCampaigns': 1, 
    'FederatedCampaignsAmt': 1, 
    'FeesForServicesAccounting': 1, 
    'FeesForServicesAccountingGrp': 1, 
    'FeesForServicesInvstMgmntFees': 1, 
    'FeesForServicesLegal': 1, 
    'FeesForServicesLegalGrp': 1, 
    'FeesForServicesLobbying': 1, 
    'FeesForServicesLobbyingGrp': 1, 
    'FeesForServicesManagement': 1, 
    'FeesForServicesManagementGrp': 1, 
    'FeesForServicesOther': 1, 
    'FeesForServicesOtherGrp': 1, 
    'FeesForServicesProfFundraising': 1, 
    'FeesForSrvcInvstMgmntFeesGrp': 1, 
    'Filer': 1, 
    'FinalReturnInd': 1, 
    'FollowSFAS117': 1, 
    'ForeignGrants': 1, 
    'ForeignGrantsGrp': 1, 
    'Form990ProvidedToGoverningBody': 1, 
    'Form990ProvidedToGvrnBodyInd': 1, 
    'FormationYr': 1, 
    'FormerOfcrEmployeesListedInd': 1, 
    'FormersListed': 1, 
    'FundraisingActivities': 1, 
    'FundraisingActivitiesInd': 1, 
    'FundraisingAmt': 1, 
    'FundraisingDirectExpenses': 1, 
    'FundraisingDirectExpensesAmt': 1, 
    'FundraisingEvents': 1, 
    'FundraisingGrossIncomeAmt': 1, 
    'Gaming': 1, 
    'GamingActivitiesInd': 1, 
    'GamingDirectExpenses': 1, 
    'GamingDirectExpensesAmt': 1, 
    'GamingGrossIncomeAmt': 1, 
    'GoverningBodyVotingMembersCnt': 1, 
    'GovernmentGrants': 1, 
    'GovernmentGrantsAmt': 1, 
    'GrantAmt': 1, 
    'Grants': 1, 
    'GrantsAndSimilarAmntsCY': 1, 
    'GrantsAndSimilarAmntsPriorYear': 1, 
    'GrantsPayable': 1, 
    'GrantsPayableGrp': 1, 
    'GrantsToDomesticIndividuals': 1, 
    'GrantsToDomesticIndividualsGrp': 1, 
    'GrantsToDomesticOrgs': 1, 
    'GrantsToDomesticOrgsGrp': 1, 
    'GrossIncomeFundraisingEvents': 1, 
    'GrossIncomeGaming': 1, 
    'GrossReceipts': 1, 
    'GrossReceiptsAmt': 1, 
    'GrossSalesOfInventory': 1, 
    'GrossSalesOfInventoryAmt': 1, 
    'GroupExemptionNum': 1, 
    'GroupExemptionNumber': 1, 
    'GroupReturnForAffiliates': 1, 
    'GroupReturnForAffiliatesInd': 1, 
    'IRPDocumentCnt': 1, 
    'IndependentVotingMemberCnt': 1, 
    'IndivRcvdGreaterThan100KCnt': 1, 
    'InfoInScheduleOPartIII': 1, 
    'InfoInScheduleOPartIIIInd': 1, 
    'InfoInScheduleOPartIX': 1, 
    'InfoInScheduleOPartIXInd': 1, 
    'InfoInScheduleOPartV': 1, 
    'InfoInScheduleOPartVI': 1, 
    'InfoInScheduleOPartVII': 1, 
    'InfoInScheduleOPartVIII': 1, 
    'InfoInScheduleOPartVIIIInd': 1, 
    'InfoInScheduleOPartVIIInd': 1, 
    'InfoInScheduleOPartVIInd': 1, 
    'InfoInScheduleOPartVInd': 1, 
    'InfoInScheduleOPartX': 1, 
    'InfoInScheduleOPartXI': 1, 
    'InfoInScheduleOPartXII': 1, 
    'InfoInScheduleOPartXIIInd': 1, 
    'InfoInScheduleOPartXIInd': 1, 
    'InfoInScheduleOPartXInd': 1, 
    'InformationTechnology': 1, 
    'InformationTechnologyGrp': 1, 
    'InitialReturn': 1, 
    'InitialReturnInd': 1, 
    'Insurance': 1, 
    'InsuranceGrp': 1, 
    'IntangibleAssets': 1, 
    'IntangibleAssetsGrp': 1, 
    'Interest': 1, 
    'InterestGrp': 1, 
    'InventoriesForSaleOrUse': 1, 
    'InventoriesForSaleOrUseGrp': 1, 
    'InvestmentExpenseAmt': 1, 
    'InvestmentInJointVenture': 1, 
    'InvestmentInJointVentureInd': 1, 
    'InvestmentIncomeCurrentYear': 1, 
    'InvestmentIncomePriorYear': 1, 
    'InvestmentsOtherSecurities': 1, 
    'InvestmentsOtherSecuritiesGrp': 1, 
    'InvestmentsProgramRelated': 1, 
    'InvestmentsProgramRelatedGrp': 1, 
    'InvestmentsPubTradedSecGrp': 1, 
    'InvestmentsPubTradedSecurities': 1, 
    'LandBldgEquipAccumDeprecAmt': 1, 
    'LandBldgEquipBasisNetGrp': 1, 
    'LandBldgEquipCostOrOtherBssAmt': 1, 
    'LandBldgEquipmentAccumDeprec': 1, 
    'LandBuildingsEquipmentBasis': 1, 
    'LandBuildingsEquipmentBasisNet': 1, 
    'LegalDomicileCountryCd': 1, 
    'LegalDomicileStateCd': 1, 
    'LoansFromOfficersDirectors': 1, 
    'LoansFromOfficersDirectorsGrp': 1, 
    'LobbyingActivities': 1, 
    'LobbyingActivitiesInd': 1, 
    'LocalChapters': 1, 
    'LocalChaptersInd': 1, 
    'MaterialDiversionOrMisuse': 1, 
    'MaterialDiversionOrMisuseInd': 1, 
    'MembersOrStockholders': 1, 
    'MembersOrStockholdersInd': 1, 
    'MembershipDues': 1, 
    'MembershipDuesAmt': 1, 
    'MethodOfAccountingAccrual': 1, 
    'MethodOfAccountingAccrualInd': 1, 
    'MethodOfAccountingCash': 1, 
    'MethodOfAccountingCashInd': 1, 
    'MethodOfAccountingOther': 1, 
    'MethodOfAccountingOtherInd': 1, 
    'MinutesOfCommittees': 1, 
    'MinutesOfCommitteesInd': 1, 
    'MinutesOfGoverningBody': 1, 
    'MinutesOfGoverningBodyInd': 1, 
    'MissionDesc': 1, 
    'MissionDescription': 1, 
    'MortNotesPyblSecuredInvestProp': 1, 
    'MortgNotesPyblScrdInvstPropGrp': 1, 
    'NameOfPrincipalOfficerPerson': 1, 
    'NbrIndependentVotingMembers': 1, 
    'NbrVotingGoverningBodyMembers': 1, 
    'NbrVotingMembersGoverningBody': 1, 
    'NetAssetsOrFundBalancesBOY': 1, 
    'NetAssetsOrFundBalancesBOYAmt': 1, 
    'NetAssetsOrFundBalancesEOY': 1, 
    'NetAssetsOrFundBalancesEOYAmt': 1, 
    'NetUnrelatedBusTxblIncmAmt': 1, 
    'NetUnrelatedBusinessTxblIncome': 1, 
    'NetUnrlzdGainsLossesInvstAmt': 1, 
    'NoListedPersonsCompensated': 1, 
    'NoListedPersonsCompensatedInd': 1, 
    'NoncashContributions': 1, 
    'NoncashContributionsAmt': 1, 
    'NumberFormsTransmittedWith1096': 1, 
    'NumberIndependentVotingMembers': 1, 
    'NumberIndividualsGT100K': 1, 
    'NumberOfContractorsGT100K': 1, 
    'NumberOfEmployees': 1, 
    'Occupancy': 1, 
    'OccupancyGrp': 1, 
    'OfficeExpenses': 1, 
    'OfficeExpensesGrp': 1, 
    'Officer': 1, 
    'OfficerMailingAddress': 1, 
    'OfficerMailingAddressInd': 1, 
    'OrgDoesNotFollowSFAS117Ind': 1, 
    'Organization4947a1': 1, 
    'Organization4947a1NotPFInd': 1, 
    'Organization501c': 1, 
    'Organization501c3': 1, 
    'Organization501c3Ind': 1, 
    'Organization501cInd': 1, 
    'OrganizationFollowsSFAS117Ind': 1, 
    'OthNotesLoansReceivableNetGrp': 1, 
    'OtherAssetsTotal': 1, 
    'OtherAssetsTotalGrp': 1, 
    'OtherEmployeeBenefits': 1, 
    'OtherEmployeeBenefitsGrp': 1, 
    'OtherExpensePriorYear': 1, 
    'OtherExpenses': 1, 
    'OtherExpensesCurrentYear': 1, 
    'OtherExpensesGrp': 1, 
    'OtherLiabilities': 1, 
    'OtherLiabilitiesGrp': 1, 
    'OtherNotesLoansReceivableNet': 1, 
    'OtherOrganizationDsc': 1, 
    'OtherRevenueCurrentYear': 1, 
    'OtherRevenuePriorYear': 1, 
    'OtherRevenueTotalAmt': 1, 
    'OtherSalariesAndWages': 1, 
    'OtherSalariesAndWagesGrp': 1, 
    'OtherWebsite': 1, 
    'OtherWebsiteInd': 1, 
    'OwnWebsite': 1, 
    'OwnWebsiteInd': 1, 
    'PYBenefitsPaidToMembersAmt': 1, 
    'PYContributionsGrantsAmt': 1, 
    'PYExcessBenefitTransInd': 1, 
    'PYGrantsAndSimilarPaidAmt': 1, 
    'PYInvestmentIncomeAmt': 1, 
    'PYOtherExpensesAmt': 1, 
    'PYOtherRevenueAmt': 1, 
    'PYProgramServiceRevenueAmt': 1, 
    'PYRevenuesLessExpensesAmt': 1, 
    'PYSalariesCompEmpBnftPaidAmt': 1, 
    'PYTotalExpensesAmt': 1, 
    'PYTotalProfFndrsngExpnsAmt': 1, 
    'PYTotalRevenueAmt': 1, 
    'PaymentsToAffiliates': 1, 
    'PaymentsToAffiliatesGrp': 1, 
    'PayrollTaxes': 1, 
    'PayrollTaxesGrp': 1, 
    'PensionPlanContributions': 1, 
    'PensionPlanContributionsGrp': 1, 
    'PermanentlyRestrictedNetAssets': 1, 
    'PermanentlyRstrNetAssetsGrp': 1, 
    'PledgesAndGrantsReceivable': 1, 
    'PledgesAndGrantsReceivableGrp': 1, 
    'PoliciesReferenceChapters': 1, 
    'PoliciesReferenceChaptersInd': 1, 
    'PoliticalActivities': 1, 
    'PoliticalCampaignActyInd': 1, 
    'PrepaidExpensesDeferredCharges': 1, 
    'PrepaidExpensesDefrdChargesGrp': 1, 
    'PrincipalOfficerNm': 1, 
    'PriorExcessBenefitTransaction': 1, 
    'PriorPeriodAdjustmentsAmt': 1, 
    'ProfessionalFundraising': 1, 
    'ProfessionalFundraisingInd': 1, 
    'ProgSrvcAccomActy2Grp': 1, 
    'ProgSrvcAccomActy3Grp': 1, 
    'ProgSrvcAccomActyOtherGrp': 1, 
    'ProgramServiceRevenueCY': 1, 
    'ProgramServiceRevenuePriorYear': 1, 
    'PymtTravelEntrtnmntPubOfclGrp': 1, 
    'RcvblFromDisqualifiedPrsnGrp': 1, 
    'ReceivablesFromDisqualPersons': 1, 
    'ReconcilationDonatedServices': 1, 
    'ReconcilationInvestExpenses': 1, 
    'ReconcilationPriorAdjustment': 1, 
    'ReconcilationRevenueExpenses': 1, 
    'ReconcilationRevenueExpnssAmt': 1, 
    'ReconciliationUnrealizedInvest': 1, 
    'RegularMonitoringEnforcement': 1, 
    'RegularMonitoringEnfrcInd': 1, 
    'RelatedEntity': 1, 
    'RelatedEntityInd': 1, 
    'RelatedOrgControlledEntity': 1, 
    'RelatedOrganizationCtrlEntInd': 1, 
    'RelatedOrganizations': 1, 
    'RelatedOrganizationsAmt': 1, 
    'RetainedEarningsEndowmentEtc': 1, 
    'ReturnTs': 1, 
    'Revenue': 1, 
    'RevenueAmt': 1, 
    'RevenuesLessExpensesCY': 1, 
    'RevenuesLessExpensesPriorYear': 1, 
    'Royalties': 1, 
    'RoyaltiesGrp': 1, 
    'RtnEarnEndowmentIncmOthFndsGrp': 1, 
    'SalariesEtcCurrentYear': 1, 
    'SalariesEtcPriorYear': 1, 
    'SavingsAndTempCashInvestments': 1, 
    'SavingsAndTempCashInvstGrp': 1, 
    'SignificantChange': 1, 
    'SignificantChangeInd': 1, 
    'SignificantNewProgramServices': 1, 
    'SignificantNewProgramSrvcInd': 1, 
    'SpecialConditionDesc': 1, 
    'SpecialConditionDescription': 1, 
    'StateLegalDomicile': 1, 
    'StatesWhereCopyOfReturnIsFiled': 1, 
    'StatesWhereCopyOfReturnIsFldCd': 1, 
    'TaxExemptBondLiabilities': 1, 
    'TaxExemptBondLiabilitiesGrp': 1, 
    'TaxPeriod': 1, 
    'TaxPeriodBeginDate': 1, 
    'TaxPeriodBeginDt': 1, 
    'TaxPeriodEndDate': 1, 
    'TaxPeriodEndDt': 1, 
    'TaxYear': 1, 
    'TaxYr': 1, 
    'TemporarilyRestrictedNetAssets': 1, 
    'TemporarilyRstrNetAssetsGrp': 1, 
    'TerminatedReturn': 1, 
    'TerminationOrContraction': 1, 
    'Timestamp': 1, 
    'TotReportableCompRltdOrgAmt': 1, 
    'TotalAssets': 1, 
    'TotalAssetsBOY': 1, 
    'TotalAssetsBOYAmt': 1, 
    'TotalAssetsEOY': 1, 
    'TotalAssetsEOYAmt': 1, 
    'TotalAssetsGrp': 1, 
    'TotalCompGT150K': 1, 
    'TotalCompGreaterThan150KInd': 1, 
    'TotalContributions': 1, 
    'TotalContributionsAmt': 1, 
    'TotalEmployeeCnt': 1, 
    'TotalExpensesCurrentYear': 1, 
    'TotalExpensesPriorYear': 1, 
    'TotalFunctionalExpenses': 1, 
    'TotalFunctionalExpensesGrp': 1, 
    'TotalFundrsngExpCurrentYear': 1, 
    'TotalGrossUBI': 1, 
    'TotalGrossUBIAmt': 1, 
    'TotalJointCosts': 1, 
    'TotalJointCostsGrp': 1, 
    'TotalLiabilitiesBOY': 1, 
    'TotalLiabilitiesBOYAmt': 1, 
    'TotalLiabilitiesEOY': 1, 
    'TotalLiabilitiesEOYAmt': 1, 
    'TotalNbrEmployees': 1, 
    'TotalNbrVolunteers': 1, 
    'TotalOfOtherProgramServiceExp': 1, 
    'TotalOfOtherProgramServiceGrnt': 1, 
    'TotalOfOtherProgramServiceRev': 1, 
    'TotalOtherCompensation': 1, 
    'TotalOtherCompensationAmt': 1, 
    'TotalOtherProgSrvcExpenseAmt': 1, 
    'TotalOtherProgSrvcGrantAmt': 1, 
    'TotalOtherProgSrvcRevenueAmt': 1, 
    'TotalOtherRevenue': 1, 
    'TotalProfFundrsngExpCY': 1, 
    'TotalProfFundrsngExpPriorYear': 1, 
    'TotalProgramServiceExpense': 1, 
    'TotalProgramServiceExpensesAmt': 1, 
    'TotalProgramServiceRevenue': 1, 
    'TotalProgramServiceRevenueAmt': 1, 
    'TotalReportableCompFrmRltdOrgs': 1, 
    'TotalReportableCompFromOrg': 1, 
    'TotalReportableCompFromOrgAmt': 1, 
    'TotalRevenue': 1, 
    'TotalRevenueCurrentYear': 1, 
    'TotalRevenueGrp': 1, 
    'TotalRevenuePriorYear': 1, 
    'TotalVolunteersCnt': 1, 
    'TransactionRelatedEntity': 1, 
    'TransactionWithControlEntInd': 1, 
    'TransfersToExemptNonChrtblOrg': 1, 
    'Travel': 1, 
    'TravelEntrtnmntPublicOfficials': 1, 
    'TravelGrp': 1, 
    'TrnsfrExmptNonChrtblRltdOrgInd': 1, 
    'TypeOfOrgOtherDescription': 1, 
    'TypeOfOrganizationAssocInd': 1, 
    'TypeOfOrganizationAssociation': 1, 
    'TypeOfOrganizationCorpInd': 1, 
    'TypeOfOrganizationCorporation': 1, 
    'TypeOfOrganizationOther': 1, 
    'TypeOfOrganizationOtherInd': 1, 
    'TypeOfOrganizationTrust': 1, 
    'TypeOfOrganizationTrustInd': 1, 
    'UnrelatedBusIncmOverLimitInd': 1, 
    'UnrelatedBusinessIncome': 1, 
    'UnrestrictedNetAssets': 1, 
    'UnrestrictedNetAssetsGrp': 1, 
    'UnsecuredNotesLoansPayable': 1, 
    'UnsecuredNotesLoansPayableGrp': 1, 
    'UponRequest': 1, 
    'UponRequestInd': 1, 
    'VotingMembersGoverningBodyCnt': 1, 
    'VotingMembersIndependentCnt': 1, 
    'WebSite': 1, 
    'WebsiteAddressTxt': 1, 
    'WhistleblowerPolicy': 1, 
    'WhistleblowerPolicyInd': 1, 
    'WrittenPolicyOrProcedure': 1, 
    'WrittenPolicyOrProcedureInd': 1, 
    'YearFormation': 1})

<br>In the next block we define a function that will allow us to read in the MongoDB data. This is only necessary for very large datasets. 

In [26]:
"""
def batched(cursor, batch_size):
    batch = []
    for doc in cursor:
        batch.append(doc) #<timed exec>:5: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
        if batch and not len(batch) % batch_size:
            yield batch
            batch = []
    if batch:
        yield batch
"""

### Updated version of above

In [33]:
def batched(cursor, batch_size):
    batch = []
    for doc in cursor:
        batch.append(doc)
        if len(batch) >= batch_size:
            yield batch
            batch = []
    if batch:
        yield batch

### Read 990 DB into PANDAS DF
Read verified variables for all filings into a PANDAS dataframe. This will take several hours depending on your machine.

In [34]:
%%time
batches = []
for batch in batched(cursor, 10000):
    batches.append(pd.DataFrame(batch))

df = pd.concat(batches, ignore_index=True)
print(len(df))
df[:2]

CPU times: total: 14min 48s
Wall time: 4h 29min 22s


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,ReturnHeader,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,TypeOfOrganizationCorporation,YearFormation,StateLegalDomicile,ActivityOrMissionDescription,NbrVotingMembersGoverningBody,NbrIndependentVotingMembers,TotalNbrEmployees,TotalNbrVolunteers,TotalGrossUBI,NetUnrelatedBusinessTxblIncome,ContributionsGrantsPriorYear,ContributionsGrantsCurrentYear,ProgramServiceRevenuePriorYear,ProgramServiceRevenueCY,InvestmentIncomePriorYear,InvestmentIncomeCurrentYear,OtherRevenuePriorYear,OtherRevenueCurrentYear,TotalRevenuePriorYear,TotalRevenueCurrentYear,GrantsAndSimilarAmntsPriorYear,GrantsAndSimilarAmntsCY,BenefitsPaidToMembersPriorYear,BenefitsPaidToMembersCY,SalariesEtcPriorYear,SalariesEtcCurrentYear,TotalProfFundrsngExpPriorYear,TotalProfFundrsngExpCY,TotalFundrsngExpCurrentYear,OtherExpensePriorYear,OtherExpensesCurrentYear,TotalExpensesPriorYear,TotalExpensesCurrentYear,RevenuesLessExpensesPriorYear,RevenuesLessExpensesCY,TotalAssetsBOY,TotalAssetsEOY,TotalLiabilitiesBOY,TotalLiabilitiesEOY,NetAssetsOrFundBalancesBOY,NetAssetsOrFundBalancesEOY,InfoInScheduleOPartIII,MissionDescription,SignificantNewProgramServices,SignificantChange,Expense,Grants,Description,TotalProgramServiceExpense,PoliticalActivities,LobbyingActivities,ProfessionalFundraising,FundraisingActivities,Gaming,ExcessBenefitTransaction,PriorExcessBenefitTransaction,DisregardedEntity,RelatedEntity,RelatedOrgControlledEntity,TransactionRelatedEntity,TransfersToExemptNonChrtblOrg,ActivitiesConductedPartnership,NumberFormsTransmittedWith1096,NumberOfEmployees,UnrelatedBusinessIncome,InfoInScheduleOPartVI,NbrVotingGoverningBodyMembers,NumberIndependentVotingMembers,FamilyOrBusinessRelationship,DelegationOfManagementDuties,ChangesToOrganizingDocs,MaterialDiversionOrMisuse,MembersOrStockholders,ElectionOfBoardMembers,DecisionsSubjectToApproval,MinutesOfGoverningBody,MinutesOfCommittees,OfficerMailingAddress,LocalChapters,Form990ProvidedToGoverningBody,ConflictOfInterestPolicy,AnnualDisclosureCoveredPersons,RegularMonitoringEnforcement,WhistleblowerPolicy,DocumentRetentionPolicy,CompensationProcessCEO,CompensationProcessOther,InvestmentInJointVenture,StatesWhereCopyOfReturnIsFiled,UponRequest,NoListedPersonsCompensated,TotalReportableCompFromOrg,TotalReportableCompFrmRltdOrgs,TotalOtherCompensation,NumberIndividualsGT100K,FormersListed,TotalCompGT150K,CompensationFromOtherSources,NumberOfContractorsGT100K,AllOtherContributions,TotalContributions,TotalOtherRevenue,TotalRevenue,GrantsToDomesticOrgs,GrantsToDomesticIndividuals,FeesForServicesLegal,FeesForServicesAccounting,OfficeExpenses,PaymentsToAffiliates,DepreciationDepletion,OtherExpenses,AllOtherExpenses,TotalFunctionalExpenses,SavingsAndTempCashInvestments,AccountsReceivable,LandBuildingsEquipmentBasis,LandBldgEquipmentAccumDeprec,LandBuildingsEquipmentBasisNet,InvestmentsOtherSecurities,TotalAssets,AccountsPayableAccruedExpenses,GrantsPayable,OtherLiabilities,FollowSFAS117,UnrestrictedNetAssets,InfoInScheduleOPartXI,ReconcilationRevenueExpenses,InfoInScheduleOPartXII,MethodOfAccountingAccrual,AccountantCompileOrReview,FSAudited,AuditCommittee,FederalGrantAuditRequired,AllAffiliatesIncluded,GroupExemptionNumber,Revenue,PoliciesReferenceChapters,WrittenPolicyOrProcedure,TotalProgramServiceRevenue,ForeignGrants,BenefitsToMembers,CompCurrentOfficersDirectors,CompDisqualPersons,OtherSalariesAndWages,PensionPlanContributions,OtherEmployeeBenefits,PayrollTaxes,FeesForServicesManagement,FeesForServicesLobbying,FeesForServicesProfFundraising,FeesForServicesInvstMgmntFees,FeesForServicesOther,Advertising,InformationTechnology,Royalties,Occupancy,Travel,TravelEntrtnmntPublicOfficials,ConferencesMeetings,Interest,Insurance,CashNonInterestBearing,PledgesAndGrantsReceivable,ReceivablesFromDisqualPersons,OtherNotesLoansReceivableNet,InventoriesForSaleOrUse,PrepaidExpensesDeferredCharges,InvestmentsPubTradedSecurities,InvestmentsProgramRelated,IntangibleAssets,OtherAssetsTotal,DeferredRevenue,MortNotesPyblSecuredInvestProp,FederalGrantAuditPerformed,LoansFromOfficersDirectors,MethodOfAccountingCash,Activity2,Activity3,InfoInScheduleOPartVII,TaxExemptBondLiabilities,TemporarilyRestrictedNetAssets,OtherWebsite,PermanentlyRestrictedNetAssets,FundraisingEvents,CntrbtnsRprtdFundraisingEvents,RelatedOrganizations,GrossIncomeFundraisingEvents,FundraisingDirectExpenses,FederatedCampaigns,GovernmentGrants,MethodOfAccountingOther,GrossSalesOfInventory,CostOfGoodsSold,DoNotFollowSFAS117,RetainedEarningsEndowmentEtc,InitialReturn,MembershipDues,GrossIncomeGaming,GamingDirectExpenses,NoncashContributions,InfoInScheduleOPartV,OwnWebsite,UnsecuredNotesLoansPayable,ActivityOther,TotalOfOtherProgramServiceExp,TotalOfOtherProgramServiceRev,EscrowAccountLiability,TotalOfOtherProgramServiceGrnt,TypeOfOrganizationOther,Organization501c,TypeOfOrganizationTrust,TypeOfOrganizationAssociation,CountryLegalDomicile,AmendedReturn,TypeOfOrgOtherDescription,TotalJointCosts,TerminatedReturn,TerminationOrContraction,ActivityCode,SpecialConditionDescription,Organization4947a1,InfoInScheduleOPartIX,ReconciliationUnrealizedInvest,ReconcilationPriorAdjustment,ReconcilationDonatedServices,ReconcilationInvestExpenses,InfoInScheduleOPartVIII,InfoInScheduleOPartX,PrincipalOfficerNm,GrossReceiptsAmt,GroupReturnForAffiliatesInd,Organization501c3Ind,TypeOfOrganizationCorpInd,FormationYr,LegalDomicileStateCd,ActivityOrMissionDesc,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,TotalEmployeeCnt,TotalGrossUBIAmt,CYContributionsGrantsAmt,CYProgramServiceRevenueAmt,CYInvestmentIncomeAmt,CYOtherRevenueAmt,CYTotalRevenueAmt,CYGrantsAndSimilarPaidAmt,CYBenefitsPaidToMembersAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalProfFndrsngExpnsAmt,CYTotalFundraisingExpenseAmt,CYOtherExpensesAmt,CYTotalExpensesAmt,CYRevenuesLessExpensesAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalLiabilitiesEOYAmt,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,InfoInScheduleOPartIIIInd,MissionDesc,SignificantNewProgramSrvcInd,SignificantChangeInd,Desc,PoliticalCampaignActyInd,LobbyingActivitiesInd,ProfessionalFundraisingInd,FundraisingActivitiesInd,GamingActivitiesInd,EngagedInExcessBenefitTransInd,PYExcessBenefitTransInd,DisregardedEntityInd,RelatedEntityInd,RelatedOrganizationCtrlEntInd,TransactionWithControlEntInd,TrnsfrExmptNonChrtblRltdOrgInd,ActivitiesConductedPrtshpInd,IRPDocumentCnt,EmployeeCnt,UnrelatedBusIncmOverLimitInd,GoverningBodyVotingMembersCnt,IndependentVotingMemberCnt,FamilyOrBusinessRlnInd,DelegationOfMgmtDutiesInd,ChangeToOrgDocumentsInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,ElectionOfBoardMembersInd,DecisionsSubjectToApprovaInd,MinutesOfGoverningBodyInd,MinutesOfCommitteesInd,OfficerMailingAddressInd,LocalChaptersInd,Form990ProvidedToGvrnBodyInd,ConflictOfInterestPolicyInd,WhistleblowerPolicyInd,DocumentRetentionPolicyInd,CompensationProcessCEOInd,CompensationProcessOtherInd,InvestmentInJointVentureInd,StatesWhereCopyOfReturnIsFldCd,NoListedPersonsCompensatedInd,FormerOfcrEmployeesListedInd,TotalCompGreaterThan150KInd,CompensationFromOtherSrcsInd,MembershipDuesAmt,FundraisingAmt,AllOtherContributionsAmt,TotalContributionsAmt,OtherRevenueTotalAmt,TotalRevenueGrp,FeesForServicesAccountingGrp,OfficeExpensesGrp,InformationTechnologyGrp,ConferencesMeetingsGrp,InsuranceGrp,OtherExpensesGrp,AllOtherExpensesGrp,TotalFunctionalExpensesGrp,CashNonInterestBearingGrp,TotalAssetsGrp,OrgDoesNotFollowSFAS117Ind,RtnEarnEndowmentIncmOthFndsGrp,ReconcilationRevenueExpnssAmt,MethodOfAccountingCashInd,AccountantCompileOrReviewInd,FSAuditedInd,FederalGrantAuditRequiredInd,WebsiteAddressTxt,TotalVolunteersCnt,NetUnrelatedBusTxblIncmAmt,PYContributionsGrantsAmt,PYProgramServiceRevenueAmt,PYInvestmentIncomeAmt,PYOtherRevenueAmt,PYTotalRevenueAmt,PYGrantsAndSimilarPaidAmt,PYBenefitsPaidToMembersAmt,PYSalariesCompEmpBnftPaidAmt,PYTotalProfFndrsngExpnsAmt,PYOtherExpensesAmt,PYTotalExpensesAmt,PYRevenuesLessExpensesAmt,TotalLiabilitiesBOYAmt,ExpenseAmt,GrantAmt,RevenueAmt,ProgSrvcAccomActy2Grp,ProgSrvcAccomActy3Grp,ProgSrvcAccomActyOtherGrp,TotalOtherProgSrvcGrantAmt,TotalProgramServiceExpensesAmt,InfoInScheduleOPartVIInd,AnnualDisclosureCoveredPrsnInd,RegularMonitoringEnfrcInd,UponRequestInd,TotalReportableCompFromOrgAmt,TotReportableCompRltdOrgAmt,TotalOtherCompensationAmt,IndivRcvdGreaterThan100KCnt,CntrctRcvdGreaterThan100KCnt,GovernmentGrantsAmt,TotalProgramServiceRevenueAmt,FundraisingGrossIncomeAmt,ContriRptFundraisingEventAmt,FundraisingDirectExpensesAmt,GrossSalesOfInventoryAmt,CostOfGoodsSoldAmt,GrantsToDomesticIndividualsGrp,CompCurrentOfcrDirectorsGrp,OtherSalariesAndWagesGrp,PensionPlanContributionsGrp,OtherEmployeeBenefitsGrp,PayrollTaxesGrp,FeesForServicesOtherGrp,AdvertisingGrp,TravelGrp,InterestGrp,DepreciationDepletionGrp,SavingsAndTempCashInvstGrp,AccountsReceivableGrp,InventoriesForSaleOrUseGrp,PrepaidExpensesDefrdChargesGrp,LandBldgEquipCostOrOtherBssAmt,LandBldgEquipAccumDeprecAmt,LandBldgEquipBasisNetGrp,InvestmentsOtherSecuritiesGrp,IntangibleAssetsGrp,AccountsPayableAccrExpnssGrp,DeferredRevenueGrp,MortgNotesPyblScrdInvstPropGrp,OtherLiabilitiesGrp,OrganizationFollowsSFAS117Ind,UnrestrictedNetAssetsGrp,TemporarilyRstrNetAssetsGrp,InfoInScheduleOPartXIInd,NetUnrlzdGainsLossesInvstAmt,InfoInScheduleOPartXIIInd,AuditCommitteeInd,AllAffiliatesIncludedInd,GrantsToDomesticOrgsGrp,ForeignGrantsGrp,BenefitsToMembersGrp,CompDisqualPersonsGrp,FeesForServicesManagementGrp,FeesForServicesLegalGrp,FeesForServicesLobbyingGrp,FeesForSrvcInvstMgmntFeesGrp,RoyaltiesGrp,OccupancyGrp,PymtTravelEntrtnmntPubOfclGrp,PaymentsToAffiliatesGrp,PledgesAndGrantsReceivableGrp,RcvblFromDisqualifiedPrsnGrp,OthNotesLoansReceivableNetGrp,InvestmentsPubTradedSecGrp,InvestmentsProgramRelatedGrp,OtherAssetsTotalGrp,TotalOtherProgSrvcExpenseAmt,InfoInScheduleOPartVInd,MethodOfAccountingAccrualInd,NoncashContributionsAmt,GrantsPayableGrp,PermanentlyRstrNetAssetsGrp,TaxExemptBondLiabilitiesGrp,EscrowAccountLiabilityGrp,LoansFromOfficersDirectorsGrp,UnsecuredNotesLoansPayableGrp,PriorPeriodAdjustmentsAmt,FederalGrantAuditPerformedInd,PoliciesReferenceChaptersInd,OtherWebsiteInd,AddressChangeInd,WrittenPolicyOrProcedureInd,RelatedOrganizationsAmt,TotalOtherProgSrvcRevenueAmt,OwnWebsiteInd,TotalJointCostsGrp,DonatedServicesAndUseFcltsAmt,LegalDomicileCountryCd,InfoInScheduleOPartIXInd,TypeOfOrganizationTrustInd,FinalReturnInd,ContractTerminationInd,InfoInScheduleOPartXInd,GroupExemptionNum,InfoInScheduleOPartVIIInd,FederatedCampaignsAmt,TypeOfOrganizationOtherInd,OtherOrganizationDsc,InfoInScheduleOPartVIIIInd,TypeOfOrganizationAssocInd,InitialReturnInd,GamingGrossIncomeAmt,GamingDirectExpensesAmt,MethodOfAccountingOtherInd,InvestmentExpenseAmt,Organization501cInd,Organization4947a1NotPFInd,AmendedReturnInd,SpecialConditionDesc,ActivityCd
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,"{'@binaryAttachmentCount': '0', 'Timestamp': '2011-11-09T06:41:09-06:00', 'TaxPeriodEndDate': '2010-12-31', 'PreparerFirm': {'PreparerFirmBusinessName': {'BusinessNameLine1': 'CONCANNON MILLER & CO PC'}, 'PreparerFirmUSAddress': {'AddressLine1': ...",X,MICHAEL ANTON,1473903,0,X,,X,1992,PA,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,10,10,0,0,0,0,1044925,1439340,0,0,30447,33563,0,1000,1075372,1473903,638637,925000,0,0,0,0,0,0,195892,243131,459751,881768,1384751,193604,89152,1925215,2440859,171810,450430,1753405,1990429,X,"THE CORPORATION IS ORGANIZED AND WILL BE OPERATED EXCLUSIVELY FOR CHARITABLE, EDUCATIONAL AND SCIENTIFIC PURPOSES WITHIN THE MEANING OF SECTION 501(C)(3) OF THE INTERNAL REVENUE CODE. SUCH PURPOSES SHALL BE LIMITED TO PROVIDING SUPPORT AND FUNDIN...",0,0,1043744,925000,"RMHC OF THE PHILADELPHIA REGION, INC. GRANTS HUNDREDS OF THOUSANDS OF DOLLARS PER YEAR TO SUPPORT NON-PROFIT PROGRAMS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN. LOCALLY, RMHC SUPPORTS THE PHILADELPHIA, SOUTHERN NEW JERSEY AND DE...",1043744,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,X,10,10,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,0,0,"[PA, NJ, DE]",X,X,0,0,0,0,0,0,0,0,1439340,1439340,1000,"{'TotalRevenueColumn': '1473903', 'RelatedOrExemptFunctionIncome': '1000', 'UnrelatedBusinessRevenue': '0', 'ExclusionAmount': '33563'}","{'Total': '892000', 'ProgramServices': '892000'}","{'Total': '33000', 'ProgramServices': '33000'}","{'Total': '215', 'ManagementAndGeneral': '215'}","{'Total': '21675', 'ManagementAndGeneral': '21675'}","{'Total': '123', 'ManagementAndGeneral': '123'}","{'Total': '118744', 'ProgramServices': '118744'}","{'Total': '86228', 'ManagementAndGeneral': '86228'}","[{'Description': 'FUNDRAISING COSTS', 'Total': '108311', 'Fundraising': '108311'}, {'Description': 'CANISTER COLLECTION FEE', 'Total': '81925', 'Fundraising': '81925'}, {'Description': 'PR/ADMINISTRATIVE SERVI', 'Total': '34517', 'ManagementAndGe...","{'Total': '763', 'ManagementAndGeneral': '763'}","{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}","{'BOY': '332660', 'EOY': '270700'}","{'BOY': '103412', 'EOY': '147981'}",256845,86228,"{'BOY': '0', 'EOY': '170617'}","{'BOY': '1489143', 'EOY': '1851561'}","{'BOY': '1925215', 'EOY': '2440859'}","{'BOY': '39670', 'EOY': '44353'}","{'BOY': '80500', 'EOY': '166000'}","{'BOY': '51640', 'EOY': '240077'}",X,"{'BOY': '1753405', 'EOY': '1990429'}",X,89152,X,X,0,1,1,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [36]:
print(len(df))

3469008


#### Save DF
We will save the dataset in gzipped PANDAS format. This is a very large file so it will take some time. 

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
#df.to_pickle('all NEW filings April 2025 - all control variables.pkl.gz', compression='gzip')

Current date and time :  2025-04-10 14:37:40 



# Alternative Approaches

In [36]:
#cursor = filings_990.find({}, {'_id': 0, 'EIN': 1, 'OrganizationName': 1, 'DLN': 1, 'URL': 1,  'ReturnHeader': 1,
projection_fields = {'_id': 1, 
    'EIN': 1, 'OrganizationName': 1, 'DLN': 1, 'URL': 1,  'ReturnHeader': 1,
    'AccountantCompileOrReview': 1, 
    'AccountantCompileOrReviewInd': 1, 
    'AccountsPayableAccrExpnssGrp': 1, 
    'AccountsPayableAccruedExpenses': 1, 
    'AccountsReceivable': 1, 
    'AccountsReceivableGrp': 1, 
    'ActivitiesConductedPartnership': 1, 
    'ActivitiesConductedPrtshpInd': 1, 
    'Activity2': 1, 
    'Activity3': 1, 
    'ActivityCd': 1, 
    'ActivityCode': 1, 
    'ActivityOrMissionDesc': 1, 
    'ActivityOrMissionDescription': 1, 
    'ActivityOther': 1, 
    'AddressChange': 1, 
    'AddressChangeInd': 1, 
    'Advertising': 1, 
    'AdvertisingGrp': 1, 
    'AllAffiliatesIncluded': 1, 
    'AllAffiliatesIncludedInd': 1, 
    'AllOtherContributions': 1, 
    'AllOtherContributionsAmt': 1, 
    'AllOtherExpenses': 1, 
    'AllOtherExpensesGrp': 1, 
    'AmendedReturn': 1, 
    'AmendedReturnInd': 1, 
    'AnnualDisclosureCoveredPersons': 1, 
    'AnnualDisclosureCoveredPrsnInd': 1, 
    'AuditCommittee': 1, 
    'AuditCommitteeInd': 1, 
    'BenefitsPaidToMembersCY': 1, 
    'BenefitsPaidToMembersPriorYear': 1, 
    'BenefitsToMembers': 1, 
    'BenefitsToMembersGrp': 1, 
    'BuildTS': 1, 
    'BusinessOfficerGrp': 1, 
    'CYBenefitsPaidToMembersAmt': 1, 
    'CYContributionsGrantsAmt': 1, 
    'CYGrantsAndSimilarPaidAmt': 1, 
    'CYInvestmentIncomeAmt': 1, 
    'CYOtherExpensesAmt': 1, 
    'CYOtherRevenueAmt': 1, 
    'CYProgramServiceRevenueAmt': 1, 
    'CYRevenuesLessExpensesAmt': 1, 
    'CYSalariesCompEmpBnftPaidAmt': 1, 
    'CYTotalExpensesAmt': 1, 
    'CYTotalFundraisingExpenseAmt': 1, 
    'CYTotalProfFndrsngExpnsAmt': 1, 
    'CYTotalRevenueAmt': 1, 
    'CashNonInterestBearing': 1, 
    'CashNonInterestBearingGrp': 1, 
    'ChangeToOrgDocumentsInd': 1, 
    'ChangesToOrganizingDocs': 1, 
    'CntrbtnsRprtdFundraisingEvents': 1, 
    'CntrctRcvdGreaterThan100KCnt': 1, 
    'CompCurrentOfcrDirectorsGrp': 1, 
    'CompCurrentOfficersDirectors': 1, 
    'CompDisqualPersons': 1, 
    'CompDisqualPersonsGrp': 1, 
    'CompensationFromOtherSources': 1, 
    'CompensationFromOtherSrcsInd': 1, 
    'CompensationProcessCEO': 1, 
    'CompensationProcessCEOInd': 1, 
    'CompensationProcessOther': 1, 
    'CompensationProcessOtherInd': 1, 
    'ConferencesMeetings': 1, 
    'ConferencesMeetingsGrp': 1, 
    'ConflictOfInterestPolicy': 1, 
    'ConflictOfInterestPolicyInd': 1, 
    'ContractTerminationInd': 1, 
    'ContriRptFundraisingEventAmt': 1, 
    'ContributionsGrantsCurrentYear': 1, 
    'ContributionsGrantsPriorYear': 1, 
    'CostOfGoodsSold': 1, 
    'CostOfGoodsSoldAmt': 1, 
    'CountryLegalDomicile': 1, 
    'DecisionsSubjectToApprovaInd': 1, 
    'DecisionsSubjectToApproval': 1, 
    'DeferredRevenue': 1, 
    'DeferredRevenueGrp': 1, 
    'DelegationOfManagementDuties': 1, 
    'DelegationOfMgmtDutiesInd': 1, 
    'DepreciationDepletion': 1, 
    'DepreciationDepletionGrp': 1, 
    'Desc': 1, 
    'Description': 1, 
    'DisregardedEntity': 1, 
    'DisregardedEntityInd': 1, 
    'DoNotFollowSFAS117': 1, 
    'DocumentRetentionPolicy': 1, 
    'DocumentRetentionPolicyInd': 1, 
    'DonatedServicesAndUseFcltsAmt': 1, 
    'ElectionOfBoardMembers': 1, 
    'ElectionOfBoardMembersInd': 1, 
    'EmployeeCnt': 1, 
    'EngagedInExcessBenefitTransInd': 1, 
    'EscrowAccountLiability': 1, 
    'EscrowAccountLiabilityGrp': 1, 
    'ExcessBenefitTransaction': 1, 
    'Expense': 1, 
    'ExpenseAmt': 1, 
    'FSAudited': 1, 
    'FSAuditedInd': 1, 
    'FamilyOrBusinessRelationship': 1, 
    'FamilyOrBusinessRlnInd': 1, 
    'FederalGrantAuditPerformed': 1, 
    'FederalGrantAuditPerformedInd': 1, 
    'FederalGrantAuditRequired': 1, 
    'FederalGrantAuditRequiredInd': 1, 
    'FederatedCampaigns': 1, 
    'FederatedCampaignsAmt': 1, 
    'FeesForServicesAccounting': 1, 
    'FeesForServicesAccountingGrp': 1, 
    'FeesForServicesInvstMgmntFees': 1, 
    'FeesForServicesLegal': 1, 
    'FeesForServicesLegalGrp': 1, 
    'FeesForServicesLobbying': 1, 
    'FeesForServicesLobbyingGrp': 1, 
    'FeesForServicesManagement': 1, 
    'FeesForServicesManagementGrp': 1, 
    'FeesForServicesOther': 1, 
    'FeesForServicesOtherGrp': 1, 
    'FeesForServicesProfFundraising': 1, 
    'FeesForSrvcInvstMgmntFeesGrp': 1, 
    'Filer': 1, 
    'FinalReturnInd': 1, 
    'FollowSFAS117': 1, 
    'ForeignGrants': 1, 
    'ForeignGrantsGrp': 1, 
    'Form990ProvidedToGoverningBody': 1, 
    'Form990ProvidedToGvrnBodyInd': 1, 
    'FormationYr': 1, 
    'FormerOfcrEmployeesListedInd': 1, 
    'FormersListed': 1, 
    'FundraisingActivities': 1, 
    'FundraisingActivitiesInd': 1, 
    'FundraisingAmt': 1, 
    'FundraisingDirectExpenses': 1, 
    'FundraisingDirectExpensesAmt': 1, 
    'FundraisingEvents': 1, 
    'FundraisingGrossIncomeAmt': 1, 
    'Gaming': 1, 
    'GamingActivitiesInd': 1, 
    'GamingDirectExpenses': 1, 
    'GamingDirectExpensesAmt': 1, 
    'GamingGrossIncomeAmt': 1, 
    'GoverningBodyVotingMembersCnt': 1, 
    'GovernmentGrants': 1, 
    'GovernmentGrantsAmt': 1, 
    'GrantAmt': 1, 
    'Grants': 1, 
    'GrantsAndSimilarAmntsCY': 1, 
    'GrantsAndSimilarAmntsPriorYear': 1, 
    'GrantsPayable': 1, 
    'GrantsPayableGrp': 1, 
    'GrantsToDomesticIndividuals': 1, 
    'GrantsToDomesticIndividualsGrp': 1, 
    'GrantsToDomesticOrgs': 1, 
    'GrantsToDomesticOrgsGrp': 1, 
    'GrossIncomeFundraisingEvents': 1, 
    'GrossIncomeGaming': 1, 
    'GrossReceipts': 1, 
    'GrossReceiptsAmt': 1, 
    'GrossSalesOfInventory': 1, 
    'GrossSalesOfInventoryAmt': 1, 
    'GroupExemptionNum': 1, 
    'GroupExemptionNumber': 1, 
    'GroupReturnForAffiliates': 1, 
    'GroupReturnForAffiliatesInd': 1, 
    'IRPDocumentCnt': 1, 
    'IndependentVotingMemberCnt': 1, 
    'IndivRcvdGreaterThan100KCnt': 1, 
    'InfoInScheduleOPartIII': 1, 
    'InfoInScheduleOPartIIIInd': 1, 
    'InfoInScheduleOPartIX': 1, 
    'InfoInScheduleOPartIXInd': 1, 
    'InfoInScheduleOPartV': 1, 
    'InfoInScheduleOPartVI': 1, 
    'InfoInScheduleOPartVII': 1, 
    'InfoInScheduleOPartVIII': 1, 
    'InfoInScheduleOPartVIIIInd': 1, 
    'InfoInScheduleOPartVIIInd': 1, 
    'InfoInScheduleOPartVIInd': 1, 
    'InfoInScheduleOPartVInd': 1, 
    'InfoInScheduleOPartX': 1, 
    'InfoInScheduleOPartXI': 1, 
    'InfoInScheduleOPartXII': 1, 
    'InfoInScheduleOPartXIIInd': 1, 
    'InfoInScheduleOPartXIInd': 1, 
    'InfoInScheduleOPartXInd': 1, 
    'InformationTechnology': 1, 
    'InformationTechnologyGrp': 1, 
    'InitialReturn': 1, 
    'InitialReturnInd': 1, 
    'Insurance': 1, 
    'InsuranceGrp': 1, 
    'IntangibleAssets': 1, 
    'IntangibleAssetsGrp': 1, 
    'Interest': 1, 
    'InterestGrp': 1, 
    'InventoriesForSaleOrUse': 1, 
    'InventoriesForSaleOrUseGrp': 1, 
    'InvestmentExpenseAmt': 1, 
    'InvestmentInJointVenture': 1, 
    'InvestmentInJointVentureInd': 1, 
    'InvestmentIncomeCurrentYear': 1, 
    'InvestmentIncomePriorYear': 1, 
    'InvestmentsOtherSecurities': 1, 
    'InvestmentsOtherSecuritiesGrp': 1, 
    'InvestmentsProgramRelated': 1, 
    'InvestmentsProgramRelatedGrp': 1, 
    'InvestmentsPubTradedSecGrp': 1, 
    'InvestmentsPubTradedSecurities': 1, 
    'LandBldgEquipAccumDeprecAmt': 1, 
    'LandBldgEquipBasisNetGrp': 1, 
    'LandBldgEquipCostOrOtherBssAmt': 1, 
    'LandBldgEquipmentAccumDeprec': 1, 
    'LandBuildingsEquipmentBasis': 1, 
    'LandBuildingsEquipmentBasisNet': 1, 
    'LegalDomicileCountryCd': 1, 
    'LegalDomicileStateCd': 1, 
    'LoansFromOfficersDirectors': 1, 
    'LoansFromOfficersDirectorsGrp': 1, 
    'LobbyingActivities': 1, 
    'LobbyingActivitiesInd': 1, 
    'LocalChapters': 1, 
    'LocalChaptersInd': 1, 
    'MaterialDiversionOrMisuse': 1, 
    'MaterialDiversionOrMisuseInd': 1, 
    'MembersOrStockholders': 1, 
    'MembersOrStockholdersInd': 1, 
    'MembershipDues': 1, 
    'MembershipDuesAmt': 1, 
    'MethodOfAccountingAccrual': 1, 
    'MethodOfAccountingAccrualInd': 1, 
    'MethodOfAccountingCash': 1, 
    'MethodOfAccountingCashInd': 1, 
    'MethodOfAccountingOther': 1, 
    'MethodOfAccountingOtherInd': 1, 
    'MinutesOfCommittees': 1, 
    'MinutesOfCommitteesInd': 1, 
    'MinutesOfGoverningBody': 1, 
    'MinutesOfGoverningBodyInd': 1, 
    'MissionDesc': 1, 
    'MissionDescription': 1, 
    'MortNotesPyblSecuredInvestProp': 1, 
    'MortgNotesPyblScrdInvstPropGrp': 1, 
    'NameOfPrincipalOfficerPerson': 1, 
    'NbrIndependentVotingMembers': 1, 
    'NbrVotingGoverningBodyMembers': 1, 
    'NbrVotingMembersGoverningBody': 1, 
    'NetAssetsOrFundBalancesBOY': 1, 
    'NetAssetsOrFundBalancesBOYAmt': 1, 
    'NetAssetsOrFundBalancesEOY': 1, 
    'NetAssetsOrFundBalancesEOYAmt': 1, 
    'NetUnrelatedBusTxblIncmAmt': 1, 
    'NetUnrelatedBusinessTxblIncome': 1, 
    'NetUnrlzdGainsLossesInvstAmt': 1, 
    'NoListedPersonsCompensated': 1, 
    'NoListedPersonsCompensatedInd': 1, 
    'NoncashContributions': 1, 
    'NoncashContributionsAmt': 1, 
    'NumberFormsTransmittedWith1096': 1, 
    'NumberIndependentVotingMembers': 1, 
    'NumberIndividualsGT100K': 1, 
    'NumberOfContractorsGT100K': 1, 
    'NumberOfEmployees': 1, 
    'Occupancy': 1, 
    'OccupancyGrp': 1, 
    'OfficeExpenses': 1, 
    'OfficeExpensesGrp': 1, 
    'Officer': 1, 
    'OfficerMailingAddress': 1, 
    'OfficerMailingAddressInd': 1, 
    'OrgDoesNotFollowSFAS117Ind': 1, 
    'Organization4947a1': 1, 
    'Organization4947a1NotPFInd': 1, 
    'Organization501c': 1, 
    'Organization501c3': 1, 
    'Organization501c3Ind': 1, 
    'Organization501cInd': 1, 
    'OrganizationFollowsSFAS117Ind': 1, 
    'OthNotesLoansReceivableNetGrp': 1, 
    'OtherAssetsTotal': 1, 
    'OtherAssetsTotalGrp': 1, 
    'OtherEmployeeBenefits': 1, 
    'OtherEmployeeBenefitsGrp': 1, 
    'OtherExpensePriorYear': 1, 
    'OtherExpenses': 1, 
    'OtherExpensesCurrentYear': 1, 
    'OtherExpensesGrp': 1, 
    'OtherLiabilities': 1, 
    'OtherLiabilitiesGrp': 1, 
    'OtherNotesLoansReceivableNet': 1, 
    'OtherOrganizationDsc': 1, 
    'OtherRevenueCurrentYear': 1, 
    'OtherRevenuePriorYear': 1, 
    'OtherRevenueTotalAmt': 1, 
    'OtherSalariesAndWages': 1, 
    'OtherSalariesAndWagesGrp': 1, 
    'OtherWebsite': 1, 
    'OtherWebsiteInd': 1, 
    'OwnWebsite': 1, 
    'OwnWebsiteInd': 1, 
    'PYBenefitsPaidToMembersAmt': 1, 
    'PYContributionsGrantsAmt': 1, 
    'PYExcessBenefitTransInd': 1, 
    'PYGrantsAndSimilarPaidAmt': 1, 
    'PYInvestmentIncomeAmt': 1, 
    'PYOtherExpensesAmt': 1, 
    'PYOtherRevenueAmt': 1, 
    'PYProgramServiceRevenueAmt': 1, 
    'PYRevenuesLessExpensesAmt': 1, 
    'PYSalariesCompEmpBnftPaidAmt': 1, 
    'PYTotalExpensesAmt': 1, 
    'PYTotalProfFndrsngExpnsAmt': 1, 
    'PYTotalRevenueAmt': 1, 
    'PaymentsToAffiliates': 1, 
    'PaymentsToAffiliatesGrp': 1, 
    'PayrollTaxes': 1, 
    'PayrollTaxesGrp': 1, 
    'PensionPlanContributions': 1, 
    'PensionPlanContributionsGrp': 1, 
    'PermanentlyRestrictedNetAssets': 1, 
    'PermanentlyRstrNetAssetsGrp': 1, 
    'PledgesAndGrantsReceivable': 1, 
    'PledgesAndGrantsReceivableGrp': 1, 
    'PoliciesReferenceChapters': 1, 
    'PoliciesReferenceChaptersInd': 1, 
    'PoliticalActivities': 1, 
    'PoliticalCampaignActyInd': 1, 
    'PrepaidExpensesDeferredCharges': 1, 
    'PrepaidExpensesDefrdChargesGrp': 1, 
    'PrincipalOfficerNm': 1, 
    'PriorExcessBenefitTransaction': 1, 
    'PriorPeriodAdjustmentsAmt': 1, 
    'ProfessionalFundraising': 1, 
    'ProfessionalFundraisingInd': 1, 
    'ProgSrvcAccomActy2Grp': 1, 
    'ProgSrvcAccomActy3Grp': 1, 
    'ProgSrvcAccomActyOtherGrp': 1, 
    'ProgramServiceRevenueCY': 1, 
    'ProgramServiceRevenuePriorYear': 1, 
    'PymtTravelEntrtnmntPubOfclGrp': 1, 
    'RcvblFromDisqualifiedPrsnGrp': 1, 
    'ReceivablesFromDisqualPersons': 1, 
    'ReconcilationDonatedServices': 1, 
    'ReconcilationInvestExpenses': 1, 
    'ReconcilationPriorAdjustment': 1, 
    'ReconcilationRevenueExpenses': 1, 
    'ReconcilationRevenueExpnssAmt': 1, 
    'ReconciliationUnrealizedInvest': 1, 
    'RegularMonitoringEnforcement': 1, 
    'RegularMonitoringEnfrcInd': 1, 
    'RelatedEntity': 1, 
    'RelatedEntityInd': 1, 
    'RelatedOrgControlledEntity': 1, 
    'RelatedOrganizationCtrlEntInd': 1, 
    'RelatedOrganizations': 1, 
    'RelatedOrganizationsAmt': 1, 
    'RetainedEarningsEndowmentEtc': 1, 
    'ReturnTs': 1, 
    'Revenue': 1, 
    'RevenueAmt': 1, 
    'RevenuesLessExpensesCY': 1, 
    'RevenuesLessExpensesPriorYear': 1, 
    'Royalties': 1, 
    'RoyaltiesGrp': 1, 
    'RtnEarnEndowmentIncmOthFndsGrp': 1, 
    'SalariesEtcCurrentYear': 1, 
    'SalariesEtcPriorYear': 1, 
    'SavingsAndTempCashInvestments': 1, 
    'SavingsAndTempCashInvstGrp': 1, 
    'SignificantChange': 1, 
    'SignificantChangeInd': 1, 
    'SignificantNewProgramServices': 1, 
    'SignificantNewProgramSrvcInd': 1, 
    'SpecialConditionDesc': 1, 
    'SpecialConditionDescription': 1, 
    'StateLegalDomicile': 1, 
    'StatesWhereCopyOfReturnIsFiled': 1, 
    'StatesWhereCopyOfReturnIsFldCd': 1, 
    'TaxExemptBondLiabilities': 1, 
    'TaxExemptBondLiabilitiesGrp': 1, 
    'TaxPeriod': 1, 
    'TaxPeriodBeginDate': 1, 
    'TaxPeriodBeginDt': 1, 
    'TaxPeriodEndDate': 1, 
    'TaxPeriodEndDt': 1, 
    'TaxYear': 1, 
    'TaxYr': 1, 
    'TemporarilyRestrictedNetAssets': 1, 
    'TemporarilyRstrNetAssetsGrp': 1, 
    'TerminatedReturn': 1, 
    'TerminationOrContraction': 1, 
    'Timestamp': 1, 
    'TotReportableCompRltdOrgAmt': 1, 
    'TotalAssets': 1, 
    'TotalAssetsBOY': 1, 
    'TotalAssetsBOYAmt': 1, 
    'TotalAssetsEOY': 1, 
    'TotalAssetsEOYAmt': 1, 
    'TotalAssetsGrp': 1, 
    'TotalCompGT150K': 1, 
    'TotalCompGreaterThan150KInd': 1, 
    'TotalContributions': 1, 
    'TotalContributionsAmt': 1, 
    'TotalEmployeeCnt': 1, 
    'TotalExpensesCurrentYear': 1, 
    'TotalExpensesPriorYear': 1, 
    'TotalFunctionalExpenses': 1, 
    'TotalFunctionalExpensesGrp': 1, 
    'TotalFundrsngExpCurrentYear': 1, 
    'TotalGrossUBI': 1, 
    'TotalGrossUBIAmt': 1, 
    'TotalJointCosts': 1, 
    'TotalJointCostsGrp': 1, 
    'TotalLiabilitiesBOY': 1, 
    'TotalLiabilitiesBOYAmt': 1, 
    'TotalLiabilitiesEOY': 1, 
    'TotalLiabilitiesEOYAmt': 1, 
    'TotalNbrEmployees': 1, 
    'TotalNbrVolunteers': 1, 
    'TotalOfOtherProgramServiceExp': 1, 
    'TotalOfOtherProgramServiceGrnt': 1, 
    'TotalOfOtherProgramServiceRev': 1, 
    'TotalOtherCompensation': 1, 
    'TotalOtherCompensationAmt': 1, 
    'TotalOtherProgSrvcExpenseAmt': 1, 
    'TotalOtherProgSrvcGrantAmt': 1, 
    'TotalOtherProgSrvcRevenueAmt': 1, 
    'TotalOtherRevenue': 1, 
    'TotalProfFundrsngExpCY': 1, 
    'TotalProfFundrsngExpPriorYear': 1, 
    'TotalProgramServiceExpense': 1, 
    'TotalProgramServiceExpensesAmt': 1, 
    'TotalProgramServiceRevenue': 1, 
    'TotalProgramServiceRevenueAmt': 1, 
    'TotalReportableCompFrmRltdOrgs': 1, 
    'TotalReportableCompFromOrg': 1, 
    'TotalReportableCompFromOrgAmt': 1, 
    'TotalRevenue': 1, 
    'TotalRevenueCurrentYear': 1, 
    'TotalRevenueGrp': 1, 
    'TotalRevenuePriorYear': 1, 
    'TotalVolunteersCnt': 1, 
    'TransactionRelatedEntity': 1, 
    'TransactionWithControlEntInd': 1, 
    'TransfersToExemptNonChrtblOrg': 1, 
    'Travel': 1, 
    'TravelEntrtnmntPublicOfficials': 1, 
    'TravelGrp': 1, 
    'TrnsfrExmptNonChrtblRltdOrgInd': 1, 
    'TypeOfOrgOtherDescription': 1, 
    'TypeOfOrganizationAssocInd': 1, 
    'TypeOfOrganizationAssociation': 1, 
    'TypeOfOrganizationCorpInd': 1, 
    'TypeOfOrganizationCorporation': 1, 
    'TypeOfOrganizationOther': 1, 
    'TypeOfOrganizationOtherInd': 1, 
    'TypeOfOrganizationTrust': 1, 
    'TypeOfOrganizationTrustInd': 1, 
    'UnrelatedBusIncmOverLimitInd': 1, 
    'UnrelatedBusinessIncome': 1, 
    'UnrestrictedNetAssets': 1, 
    'UnrestrictedNetAssetsGrp': 1, 
    'UnsecuredNotesLoansPayable': 1, 
    'UnsecuredNotesLoansPayableGrp': 1, 
    'UponRequest': 1, 
    'UponRequestInd': 1, 
    'VotingMembersGoverningBodyCnt': 1, 
    'VotingMembersIndependentCnt': 1, 
    'WebSite': 1, 
    'WebsiteAddressTxt': 1, 
    'WhistleblowerPolicy': 1, 
    'WhistleblowerPolicyInd': 1, 
    'WrittenPolicyOrProcedure': 1, 
    'WrittenPolicyOrProcedureInd': 1, 
    'YearFormation': 1}

### ✅ 1. Standard Batching with pd.concat()

In [None]:
import pandas as pd

def load_batched_to_df(collection, projection, batch_size=10000):
    cursor = collection.find({}, projection)
    batch = []
    dfs = []
    for doc in cursor:
        batch.append(doc)
        if len(batch) >= batch_size:
            dfs.append(pd.DataFrame(batch))
            batch = []
    if batch:
        dfs.append(pd.DataFrame(batch))
    return pd.concat(dfs, ignore_index=True) if dfs else pd.DataFrame()

df = load_batched_to_df(collection, projection=projection_fields)
print(df.shape)
print(len(df))
df[:2]

### ✅ 2. Export via mongoexport to JSON then pandas.read_json()

In [None]:
mongoexport --db your_database --collection filings_990 \
  --out filings.json --jsonArray \
  --type=json --fields EIN,OrganizationName,DLN,URL,ReturnHeader,AccountantCompileOrReview

In [None]:
import pandas as pd

df = pd.read_json("filings.json", orient="records")
print(df.shape)
df[:2]

### ✅ 3. Use Dask for Memory-Efficient Loading

In [None]:
mongoexport --db your_database --collection filings_990 \
  --out filings.json --jsonArray \
  --type=json --fields EIN,OrganizationName,DLN,URL,ReturnHeader,AccountantCompileOrReview

In [None]:
import dask.dataframe as dd

ddf = dd.read_json("filings.json", blocksize="64MB")  # Tune block size for performance
df = ddf.compute()  # Convert to Pandas
print(df.shape)
df[:2]

### ✅ 4. Streaming Load with _id-based Pagination

In [None]:
import pandas as pd
from bson.objectid import ObjectId

def load_paged_df(collection, projection, batch_size=10000):
    last_id = None
    dfs = []

    while True:
        query = {'_id': {'$gt': last_id}} if last_id else {}
        cursor = list(collection.find(query, projection).sort('_id', 1).limit(batch_size))
        if not cursor:
            break
        dfs.append(pd.DataFrame(cursor))
        last_id = cursor[-1]['_id']

    return pd.concat(dfs, ignore_index=True) if dfs else pd.DataFrame()

df = load_paged_df(collection, projection=projection_fields)
print(df.shape)
df[:2]

### ✅ Full Example: Batched Load with tqdm + Optional CSV/Parquet Write

In [37]:
#import pandas as pd
#from pymongo import MongoClient
from tqdm import tqdm

## --- Setup ---
#client = MongoClient("mongodb://localhost:27017/")
#db = client["your_database"]
#collection = db["filings_990"]

# Your projection (truncated for example)
#projection_fields = {
#    '_id': 0,
#    'EIN': 1,
#    'OrganizationName': 1,
#    'DLN': 1,
#    'URL': 1
#    # Add your full field list here
#}

# --- Batched Loading Function ---
def load_batched_to_df(collection, projection, batch_size=10000, max_docs=None, output_parquet=None, output_csv=None):
    total_docs = collection.count_documents({})
    if max_docs:
        total_docs = min(total_docs, max_docs)

    cursor = collection.find({}, projection).batch_size(batch_size)
    batch = []
    dfs = []
    with tqdm(total=total_docs, desc="Loading documents") as pbar:
        for doc in cursor:
            batch.append(doc)
            if len(batch) >= batch_size:
                df_batch = pd.DataFrame(batch)
                if output_parquet:
                    df_batch.to_parquet(output_parquet, engine="pyarrow", index=False, compression="snappy", append=True)
                elif output_csv:
                    df_batch.to_csv(output_csv, mode='a', header=not pd.io.common.file_exists(output_csv), index=False)
                else:
                    dfs.append(df_batch)
                batch = []
                pbar.update(batch_size)
        if batch:
            df_batch = pd.DataFrame(batch)
            if output_parquet:
                df_batch.to_parquet(output_parquet, engine="pyarrow", index=False, compression="snappy", append=True)
            elif output_csv:
                df_batch.to_csv(output_csv, mode='a', header=not pd.io.common.file_exists(output_csv), index=False)
            else:
                dfs.append(df_batch)
            pbar.update(len(batch))

    if not output_csv and not output_parquet:
        return pd.concat(dfs, ignore_index=True) if dfs else pd.DataFrame()
    else:
        return None  # Data written to file

In [42]:
#import pandas as pd
#import json
import os
#from pymongo import MongoClient
from tqdm import tqdm
from datetime import datetime

## --- Setup ---
#client = MongoClient("mongodb://localhost:27017/")
#db = client["your_database"]
#collection = db["filings_990"]

# Your projection (truncated for example)
#projection_fields = {
#    '_id': 0,
#    'EIN': 1,
#    'OrganizationName': 1,
#    'DLN': 1,
#    'URL': 1
#    # Add your full field list here
#}


#def _clean_nested_objects(df):
#    """Convert dict/list columns to JSON strings for Parquet compatibility."""
#    for col in df.columns:
#        if df[col].apply(lambda x: isinstance(x, (dict, list))).any():
#            df[col] = df[col].apply(json.dumps)
#    return df


def _clean_nested_objects(df):
    """Convert dicts/lists to JSON and ObjectId to str for Parquet compatibility."""
    for col in df.columns:
        # Convert ObjectId to string
        if df[col].apply(lambda x: isinstance(x, object) and str(type(x)).endswith("ObjectId'>")).any():
            df[col] = df[col].astype(str)
        # Convert nested dict/list to JSON string
        elif df[col].apply(lambda x: isinstance(x, (dict, list))).any():
            df[col] = df[col].apply(json.dumps)
    return df


def load_batched_to_df(
    collection,
    projection,
    batch_size=10000,
    max_docs=None,
    output_dir="filings_batches",  # ✅ new: output folder
    output_base_name="filings_batch",  # used to name each Parquet file
    return_df=True
):
    os.makedirs(output_dir, exist_ok=True)
    metadata_path = os.path.join(output_dir, "metadata.csv")

    # Load existing metadata to support resume
    completed_files = set()
    if os.path.exists(metadata_path):
        try:
            completed_metadata = pd.read_csv(metadata_path)
            completed_files = set(completed_metadata["filename"])
        except Exception:
            print("⚠️ Could not read existing metadata.csv — starting fresh")

    total_docs = collection.count_documents({})
    if max_docs:
        total_docs = min(total_docs, max_docs)

    cursor = collection.find({}, projection).batch_size(batch_size)
    batch = []
    dfs = []
    batch_num = 0

    with tqdm(total=total_docs, desc="Loading documents") as pbar:
        for doc in cursor:
            batch.append(doc)
            if len(batch) >= batch_size:
                df_batch = pd.DataFrame(batch)
                df_batch = _clean_nested_objects(df_batch)
                filename = f"{output_base_name}_{batch_num:05}.parquet"
                full_path = os.path.join(output_dir, filename)

                if filename in completed_files:
                    batch = []
                    batch_num += 1
                    pbar.update(batch_size)
                    continue  # ✅ skip if already saved

                df_batch.to_parquet(full_path, engine="pyarrow", index=False, compression="snappy")

                # ✅ write metadata
                with open(metadata_path, "a") as f:
                    f.write(f"{filename},{len(df_batch)},{datetime.now().isoformat()},{doc['_id']}\n")

                if return_df:
                    dfs.append(df_batch)

                batch = []
                batch_num += 1
                pbar.update(batch_size)

        # Final batch
        if batch:
            df_batch = pd.DataFrame(batch)
            df_batch = _clean_nested_objects(df_batch)
            filename = f"{output_base_name}_{batch_num:05}.parquet"
            full_path = os.path.join(output_dir, filename)

            if filename not in completed_files:
                df_batch.to_parquet(full_path, engine="pyarrow", index=False, compression="snappy")
                with open(metadata_path, "a") as f:
                    f.write(f"{filename},{len(df_batch)},{datetime.now().isoformat()},{doc['_id']}\n")

            if return_df:
                dfs.append(df_batch)

            pbar.update(len(batch))

    if return_df:
        return pd.concat(dfs, ignore_index=True) if dfs else pd.DataFrame()
    else:
        return None

### ✅ Example Usage
#### In-memory DataFrame:

In [None]:
#%%time
#dfx = load_batched_to_df(collection, projection=projection_fields)
#print(dfx.shape)

#### ✅ Save to Parquet in batches (with resume + metadata) and return full DataFrame

In [None]:
dfx = load_batched_to_df(
    collection=db["filings_990"],
    projection=projection_fields,
    output_dir="filings_batches",  # ✅ will be created automatically
    output_base_name="filings_batch",
    return_df=True
)

⚠️ Could not read existing metadata.csv — starting fresh


Loading documents: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3469008/3469008 [1:51:07<00:00, 520.32it/s]


#### Save DF

In [48]:
dfx

NameError: name 'dfx' is not defined

In [47]:
%%time
dfx.to_parquet("full_filings.parquet", engine="pyarrow", compression="snappy", index=False)

NameError: name 'dfx' is not defined

In [46]:
pwd

'C:\\Users\\Gregory\\IRS 990 Control Variables'

### 🧠 1. Save to Parquet in Batches + Return Combined DataFrame

In [None]:
#### ✅ Save to Parquet in batches (with resume + metadata) and return full DataFrame

dfx = load_batched_to_df(
    collection=db["filings_990"],
    projection=projection_fields,
    output_dir="filings_batches",
    output_base_name="filings_batch",
    return_df=True
)

#### 💾 2. Save to Parquet in Batches Only (No DataFrame Returned)

In [None]:
#### 💾 Save to Parquet in batches (for disk use only, no memory load)

load_batched_to_df(
    collection=db["filings_990"],
    projection=projection_fields,
    output_dir="filings_batches",
    output_base_name="filings_batch",
    return_df=False
)

#### 🧪 3. Save to CSV Instead (With Appending)

In [None]:
#### 🧪 Save to CSV incrementally, appending each batch

load_batched_to_df(
    collection=db["filings_990"],
    projection=projection_fields,
    output_csv="filings.csv",
    return_df=False
)


#### 🐍 4. Load Everything into Memory (No File Output)

In [None]:
#### 🐍 Load entire collection into memory only (in batches), no saving to disk

dfx = load_batched_to_df(
    collection=db["filings_990"],
    projection=projection_fields,
    return_df=True
)

#### 🧪 5. Load a Limited Number of Docs for Testing

In [None]:
#### 🧪 Load only first 50,000 documents for testing (writes batches to disk + memory)

dfx = load_batched_to_df(
    collection=db["filings_990"],
    projection=projection_fields,
    output_dir="test_batches",
    output_base_name="test_batch",
    max_docs=50000,
    return_df=True
)

#### 📂 6. Just Resume Partially-Completed Run

In [None]:
#### 🔁 Resume an interrupted run — only writes missing batches

dfx = load_batched_to_df(
    collection=db["filings_990"],
    projection=projection_fields,
    output_dir="filings_batches",
    output_base_name="filings_batch",
    return_df=True
)

#### 📥 Bonus: Reload All Saved Batches Later

In [None]:
#### 📥 Reload all saved batches into one big DataFrame

import glob
files = sorted(glob.glob("filings_batches/filings_batch_*.parquet"))
df_all = pd.concat([pd.read_parquet(f) for f in files], ignore_index=True)

Absolutely — here’s a **Markdown-formatted cheatsheet** version you can copy directly into your repo, notebook, or documentation:

---

```markdown
# 🧠 `load_batched_to_df()` Usage Cheatsheet

Efficiently load large MongoDB collections into memory or disk with flexible batching, Parquet/CSV export, resume support, and metadata logging.

---

## ✅ Save to Parquet in Batches (with Resume + Metadata) + Return Combined DataFrame

```python
dfx = load_batched_to_df(
    collection=db["filings_990"],
    projection=projection_fields,
    output_dir="filings_batches",
    output_base_name="filings_batch",
    return_df=True
)
```

- Saves to `filings_batches/filings_batch_*.parquet`
- Tracks progress in `filings_batches/metadata.csv`
- Returns full DataFrame as `dfx`

---

## 💾 Save to Parquet in Batches (Disk Only, No Memory Return)

```python
load_batched_to_df(
    collection=db["filings_990"],
    projection=projection_fields,
    output_dir="filings_batches",
    output_base_name="filings_batch",
    return_df=False
)
```

- Saves to disk only
- Minimal RAM usage
- Resumable after crashes

---

## 🧪 Save to CSV in Batches (Appends to One File)

```python
load_batched_to_df(
    collection=db["filings_990"],
    projection=projection_fields,
    output_csv="filings.csv",
    return_df=False
)
```

- Appends each batch to `filings.csv`
- ⚠️ Not resumable without custom logic

---

## 🐍 Load Entire Dataset to Memory Only (No Disk Save)

```python
dfx = load_batched_to_df(
    collection=db["filings_990"],
    projection=projection_fields,
    return_df=True
)
```

- No Parquet or CSV saved
- Just uses batching for speed and stability

---

## 🧪 Load a Limited Number of Docs for Testing

```python
dfx = load_batched_to_df(
    collection=db["filings_990"],
    projection=projection_fields,
    output_dir="test_batches",
    output_base_name="test_batch",
    max_docs=50000,
    return_df=True
)
```

- Use `max_docs` for sampling or testing
- Supports file saving and memory return

---

## 🔁 Resume an Interrupted Run (Auto-Skips Completed Batches)

```python
dfx = load_batched_to_df(
    collection=db["filings_990"],
    projection=projection_fields,
    output_dir="filings_batches",
    output_base_name="filings_batch",
    return_df=True
)
```

- Reads `metadata.csv` and skips existing batches
- Fully resumable

---

## 📥 Reload All Saved Batches from Disk

```python
import glob
files = sorted(glob.glob("filings_batches/filings_batch_*.parquet"))
df_all = pd.concat([pd.read_parquet(f) for f in files], ignore_index=True)
```

- Rebuild a full dataset later without re-querying MongoDB

---

🧠 Tip: You can customize `output_dir`, `output_base_name`, and `max_docs` freely.

🚀 Tested to scale to 3M+ documents with resume and logging in under 2 hours.
```

---

Let me know if you’d like a version in reStructuredText or embedded in a Python docstring!

In [8]:
pwd

'C:\\Users\\Gregory\\IRS 990 Control Variables'

In [13]:
import modin.pandas as pd

In [None]:
import modin.pandas as pd
df = pd.read_parquet("filings_batches/filings_batch_00000.parquet")

#### ✅ Step 1: Move the Files to D:\filings_batches

In [9]:
import os
import shutil

source_dir = r"C:\Users\Gregory\IRS 990 Control Variables\filings_batches"  # adjust if needed
dest_dir = r"D:\filings_batches"

# Create destination if it doesn't exist
os.makedirs(dest_dir, exist_ok=True)

# Move each file (Parquet + metadata)
for filename in os.listdir(source_dir):
    if filename.endswith(".parquet") or filename == "metadata.csv":
        shutil.move(os.path.join(source_dir, filename), os.path.join(dest_dir, filename))

print("✅ Files successfully moved to D:\\filings_batches")

✅ Files successfully moved to D:\filings_batches


#### ✅ Step 2: Read All Parquet Files into Modin as df

In [10]:
%%time

import glob
import modin.pandas as pd

# Get all batch files from the new location
files = sorted(glob.glob("D:/filings_batches/filings_batch_*.parquet"))

# Load into Modin DataFrame
df = pd.concat([pd.read_parquet(f) for f in files], ignore_index=True)

print("✅ Loaded", len(df), "rows into Modin DataFrame `df`")
print(df.shape)
df[:1]

2025-04-13 23:49:28,412	INFO worker.py:1852 -- Started a local Ray instance.


✅ Loaded 3469008 rows into Modin DataFrame `df`
(3469008, 474)
CPU times: total: 6min 51s
Wall time: 16min 4s


Unnamed: 0,_id,OrganizationName,URL,DLN,TaxPeriod,EIN,ReturnHeader,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,InitialReturnInd,GamingGrossIncomeAmt,GamingDirectExpensesAmt,MethodOfAccountingOtherInd,InvestmentExpenseAmt,Organization501cInd,Organization4947a1NotPFInd,AmendedReturnInd,SpecialConditionDesc,ActivityCd
0,5d019e6778ffca27b42818d7,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,"{""@binaryAttachmentCount"": ""0"", ""Timestamp"": ""2011-11-09T06:41:09-06:00"", ""TaxPeriodEndDate"": ""2010-12-31"", ""PreparerFirm"": {""PreparerFirmBusinessName"": {""BusinessNameLine1"": ""CONCANNON MILLER & CO PC""}, ""PreparerFirmUSAddress"": {""AddressLine1"": ...",X,MICHAEL ANTON,1473903,0,,,,,,,,,,


In [28]:
%%time
df.info()

<class 'modin.pandas.dataframe.DataFrame'>
RangeIndex: 3469008 entries, 0 to 3469007
Columns: 500 entries, _id to Form8822BAttachedInd
dtypes: object(500)
memory usage: 12.9+ GB
CPU times: total: 47.7 s
Wall time: 1min 8s


#### 🧰 1. ✅ Check RAM Usage Before Saving

In [14]:
import psutil

def check_memory_status(threshold=85):
    mem = psutil.virtual_memory()
    print(f"🧠 RAM Usage: {mem.percent}% ({mem.used / 1e9:.2f} GB / {mem.total / 1e9:.2f} GB)")
    if mem.percent >= threshold:
        print("⚠️  Warning: RAM usage is high. Consider restarting the kernel before saving.")
    else:
        print("✅ Good to go!")

#### ✅ Save df to a Single Parquet File (on D:\)

In [None]:
%%time
df.to_parquet("D:/filings_full.parquet", engine="pyarrow", compression="snappy", index=False)

#### 🧼 2. Optional Cleanup Before Save (Pandas/Modin-Safe)

In [None]:
def prepare_for_save(df):
    import gc

    # Drop any cached views
    df = df.copy()  # Break reference to any partial evaluation from .head(), etc.

    # Optionally sort or reset if needed
    # df = df.sort_values("some_column")  # Only if relevant
    # df = df.reset_index(drop=True)

    # Trigger garbage collection
    gc.collect()

    print("🧼 DataFrame copied + garbage collected. Ready to save.")
    return df

#### ✅ Example Workflow


##### 🧠 Why This Works

- `df.copy()` forces Modin/Pandas to break lazy references and recompute if needed
- `gc.collect()` frees up memory from abandoned objects
- No `.head()` lingering behind the scenes trying to be helpful

In [15]:
# Step 1: Check RAM before saving
check_memory_status()

🧠 RAM Usage: 8.9% (18.14 GB / 204.69 GB)
✅ Good to go!


In [None]:
# Step 2: Clean up df (especially if you’ve been doing .head(), .sort(), etc.)
df_clean = prepare_for_save(df)

In [16]:
%%time
import datetime
print("🕓 Save started:", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
# Step 3: Save
#df_clean.to_parquet("D:/filings_full.parquet", engine="pyarrow", compression="snappy", index=False)
df.to_parquet("D:/filings_full.parquet", engine="pyarrow", compression="snappy", index=False)

print("✅ Save completed:", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

🕓 Save started: 2025-04-14 01:00:10
✅ Save completed: 2025-04-14 01:09:39
CPU times: total: 5min 41s
Wall time: 9min 29s


# 4/13/2025






#### Read in saved file

In [16]:
%%time
import datetime
print("🕓 read started:", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
# Step 3: Save
#df_clean.to_parquet("D:/filings_full.parquet", engine="pyarrow", compression="snappy", index=False)
df = pd.read_parquet("D:/filings_full.parquet", engine="pyarrow")

print("✅ read completed:", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
print("✅ Loaded:", df.shape)

🕓 read started: 2025-04-14 22:48:03


2025-04-14 22:48:24,705	INFO worker.py:1852 -- Started a local Ray instance.


✅ read completed: 2025-04-14 22:49:39
✅ Loaded: (3469008, 474)
CPU times: total: 5.64 s
Wall time: 1min 35s


In [17]:
missing_count = df['EIN'].isna().sum()
total_count = len(df)

print(f"🧼 Missing EINs: {missing_count:,} out of {total_count:,} rows")
print(f"📉 Missing Rate: {missing_count / total_count:.2%}")

🧼 Missing EINs: 1,276,573 out of 3,469,008 rows
📉 Missing Rate: 36.80%


In [18]:
df[:2]

Unnamed: 0,_id,OrganizationName,URL,DLN,TaxPeriod,EIN,ReturnHeader,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,InitialReturnInd,GamingGrossIncomeAmt,GamingDirectExpensesAmt,MethodOfAccountingOtherInd,InvestmentExpenseAmt,Organization501cInd,Organization4947a1NotPFInd,AmendedReturnInd,SpecialConditionDesc,ActivityCd
0,5d019e6778ffca27b42818d7,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,"{""@binaryAttachmentCount"": ""0"", ""Timestamp"": ""2011-11-09T06:41:09-06:00"", ""TaxPeriodEndDate"": ""2010-12-31"", ""PreparerFirm"": {""PreparerFirmBusinessName"": {""BusinessNameLine1"": ""CONCANNON MILLER & CO PC""}, ""PreparerFirmUSAddress"": {""AddressLine1"": ...",X,MICHAEL ANTON,1473903,0,,,,,,,,,,
1,5d019e6778ffca27b42818d8,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,201106,581805618,"{""@binaryAttachmentCount"": ""0"", ""Timestamp"": ""2011-11-09T07:32:06-08:00"", ""TaxPeriodEndDate"": ""2011-06-30"", ""PreparerFirm"": {""PreparerFirmBusinessName"": {""BusinessNameLine1"": ""MADDOX & ASSOCIATES APC""}, ""PreparerFirmUSAddress"": {""AddressLine1"": ""...",,,266420,false,,,,,,,,,,


# Drop `EIN`

In [19]:
df = df.drop('EIN', axis=1)

You're welcome — and totally understood. When you’re working with massive IRS 990 datasets like this, **kernel stability is everything**. Restarting blows away memory, variable state, and hours of setup.

So here’s how we’ll help you stay in your session:

---

## ✅ 1. Updated Safe-to-Save Workflow (No Kernel Restart)

Here’s your drop-in block that:

- Checks RAM
- Cleans the frame
- Gives you confidence to save

```python
import psutil
import gc

def check_memory_status(threshold=85):
    """Print current RAM usage and warn if it's high."""
    mem = psutil.virtual_memory()
    print(f"🧠 RAM Usage: {mem.percent}% ({mem.used / 1e9:.2f} GB / {mem.total / 1e9:.2f} GB)")
    if mem.percent >= threshold:
        print("⚠️  Warning: RAM usage is high. Save may fail. Consider offloading or optimizing.")
    else:
        print("✅ Memory is in a good place for saving.")

def prepare_for_save(df):
    """Safely prepare a large DataFrame for disk write without restarting the kernel."""
    print("🧼 Copying DataFrame to clear cached views...")
    df_clean = df.copy()

    print("🧹 Running garbage collection...")
    gc.collect()

    print("✅ Frame is cleaned and ready to save.")
    return df_clean
```

---

## ✅ 2. How to Use It (Final Save Pattern)

```python
# Step 1: Check memory before you commit to saving
check_memory_status()

# Step 2: Clean any lazy eval or .head() artifacts
df_clean = prepare_for_save(df)

# Step 3: Save!
df_clean.to_parquet("D:/filings_full.parquet", engine="pyarrow", compression="snappy", index=False)
```

---

## 🧠 Why This Helps *Without* Restarting the Kernel

| Issue                           | Fixed by                |
|--------------------------------|--------------------------|
| `.head()` caching large slices | `df.copy()`              |
| Hidden references from merges  | `gc.collect()`           |
| Lazy evaluation from Modin     | Copy + save fresh object |
| Risk of RAM overflow on write  | `check_memory_status()`  |

---

Let me know if you want:
- A version that writes in **chunks** if RAM is too tight
- Auto-detect + fallback to saving as CSV if Parquet fails
- A “safe mode” wrapper for saving any big file (PKL, JSON, etc.) without crashing

This setup should keep your kernel alive and your IRS data flowing 😄

# 4/13/2025 - I haven't tested these out below. Probably would need to modify them. The above works well though. 

# ✅ Full Working Code: Resumable MongoDB Loader with Manual Restart

In [None]:
#### import os
#import pandas as pd
#from pymongo import MongoClient
from bson.objectid import ObjectId
from tqdm import tqdm

## --- Setup MongoDB Connection ---
#client = MongoClient("mongodb://localhost:27017/")
#db = client["your_database"]
#collection = db["filings_990"]

# --- Directory and Checkpoint Paths ---
output_dir = "mongo_csv_batches"
checkpoint_file = "manual_checkpoint.txt"
os.makedirs(output_dir, exist_ok=True)

## --- Projection: include only desired fields ---
#projection_fields = {
#    '_id': 1,  # Required to track progress
#    'EIN': 1,
#    'OrganizationName': 1,
#    'DLN': 1,
#    'URL': 1,
#    'ReturnHeader': 1,
#    # ... add any other fields you need
#}

# --- Function to Get Last Saved _id ---
def get_last_checkpoint():
    if not os.path.exists(checkpoint_file):
        return None
    with open(checkpoint_file, "r") as f:
        lines = f.readlines()
        if not lines:
            return None
        last_line = lines[-1].strip()
        return ObjectId(last_line) if last_line else None

# --- Main Loader Function ---
def manual_batch_loader(batch_size=10000, resume_from_id=None):
    query = {"_id": {"$gt": resume_from_id}} if resume_from_id else {}
    cursor = collection.find(query, projection_fields).sort("_id", 1).batch_size(batch_size)

    batch = []
    batch_num = len([f for f in os.listdir(output_dir) if f.endswith(".csv")])

    for doc in tqdm(cursor, desc="Downloading from MongoDB"):
        batch.append(doc)
        if len(batch) >= batch_size:
            df = pd.DataFrame(batch)
            last_id = batch[-1]["_id"]
            df.drop(columns=["_id"], inplace=True)

            filename = f"{output_dir}/batch_{batch_num:05}.csv"
            df.to_csv(filename, index=False)

            with open(checkpoint_file, "a") as f:
                f.write(str(last_id) + "\n")

            batch = []
            batch_num += 1

    # Final batch (if any)
    if batch:
        df = pd.DataFrame(batch)
        last_id = batch[-1]["_id"]
        df.drop(columns=["_id"], inplace=True)

        filename = f"{output_dir}/batch_{batch_num:05}.csv"
        df.to_csv(filename, index=False)

        with open(checkpoint_file, "a") as f:
            f.write(str(last_id) + "\n")


🧪 How to Use It
✅ First-Time Run

In [None]:
# First time: run from the beginning
manual_batch_loader()

✅ Crash Happens Midway... 😬

Let’s say the script fails. You can later manually restart from where you left off:

In [None]:
# Manually resume from last _id written to checkpoint
resume_from_id = get_last_checkpoint()
print("Resuming from _id:", resume_from_id)

manual_batch_loader(resume_from_id=resume_from_id)

✅ Optional: Load All Processed CSVs Later

In [None]:
import glob

all_files = sorted(glob.glob("mongo_csv_batches/batch_*.csv"))
df = pd.concat([pd.read_csv(f) for f in all_files], ignore_index=True)
print(df.shape)

# ✅ Full Working Code — Resumable MongoDB → Parquet Batches

In [None]:
import os
import pandas as pd
from pymongo import MongoClient
from bson.objectid import ObjectId
from tqdm import tqdm

# --- Setup MongoDB Connection ---
client = MongoClient("mongodb://localhost:27017/")
db = client["your_database"]
collection = db["filings_990"]

# --- Directory and Checkpoint Paths ---
output_dir = "mongo_parquet_batches"
checkpoint_file = "manual_checkpoint.txt"
os.makedirs(output_dir, exist_ok=True)

# --- Projection Fields ---
projection_fields = {
    '_id': 1,  # Required for tracking
    'EIN': 1,
    'OrganizationName': 1,
    'DLN': 1,
    'URL': 1,
    'ReturnHeader': 1,
    # ... Add your actual fields here
}

# --- Checkpoint Reader ---
def get_last_checkpoint():
    if not os.path.exists(checkpoint_file):
        return None
    with open(checkpoint_file, "r") as f:
        lines = f.readlines()
        if not lines:
            return None
        last_line = lines[-1].strip()
        return ObjectId(last_line) if last_line else None

# --- Parquet Batch Loader ---
def manual_batch_loader(batch_size=10000, resume_from_id=None):
    query = {"_id": {"$gt": resume_from_id}} if resume_from_id else {}
    cursor = collection.find(query, projection_fields).sort("_id", 1).batch_size(batch_size)

    batch = []
    batch_num = len([f for f in os.listdir(output_dir) if f.endswith(".parquet")])

    for doc in tqdm(cursor, desc="Downloading from MongoDB"):
        batch.append(doc)
        if len(batch) >= batch_size:
            df = pd.DataFrame(batch)
            last_id = batch[-1]["_id"]
            df.drop(columns=["_id"], inplace=True)

            filename = f"{output_dir}/batch_{batch_num:05}.parquet"
            df.to_parquet(filename, index=False, engine="pyarrow", compression="snappy")

            with open(checkpoint_file, "a") as f:
                f.write(str(last_id) + "\n")

            batch = []
            batch_num += 1

    # Final partial batch
    if batch:
        df = pd.DataFrame(batch)
        last_id = batch[-1]["_id"]
        df.drop(columns=["_id"], inplace=True)

        filename = f"{output_dir}/batch_{batch_num:05}.parquet"
        df.to_parquet(filename, index=False, engine="pyarrow", compression="snappy")

        with open(checkpoint_file, "a") as f:
            f.write(str(last_id) + "\n")

🧪 How to Use It  
✅ First-Time Run  
✅ Manual Resume Later  
- Same as previous (CSV) example  
✅ Load All Parquet Files Later

In [None]:
import glob

all_files = sorted(glob.glob("mongo_parquet_batches/batch_*.parquet"))
df = pd.concat([pd.read_parquet(f) for f in all_files], ignore_index=True)
print(df.shape)

### ✅ 2. How to Install and Use Modin

#### A. 🔧 Installation
Choose the engine (you can use Ray — it’s easiest for local machines)
    
`pip install modin[ray]`

If that doesn't work (e.g., due to ray conflicts), try:

`pip install modin`  
`pip install ray`


#### B. ✅ Replace import pandas as pd with:

#### ✅ Example: Fast Load with Modin + Save to Parquet

In [None]:
import modin.pandas as pd
from pymongo import MongoClient
from tqdm import tqdm

client = MongoClient("mongodb://localhost:27017/")
collection = client["your_database"]["filings_990"]

projection_fields = {
    '_id': 0,
    'EIN': 1,
    'OrganizationName': 1,
    'DLN': 1,
    'URL': 1,
    'ReturnHeader': 1,
    # ... your full list
}

def fast_load_modin(collection, projection, batch_size=10000):
    cursor = collection.find({}, projection).batch_size(batch_size)
    docs = []
    for doc in tqdm(cursor, desc="Loading with Modin"):
        docs.append(doc)
    return pd.DataFrame(docs)

# Load and save
df = fast_load_modin(collection, projection_fields)
df.to_parquet("modin_output.parquet", engine="pyarrow", compression="snappy")
# Or: df.to_csv("modin_output.csv", index=False)

print(df.shape)

🧠 What You Get

    ✅ Faster load time than Pandas

    ✅ Full RAM usage across both CPUs

    ✅ Same familiar .to_parquet() and .to_csv()

    ✅ No need to chunk manually — works in-memory if your RAM can handle it (which yours can)

### Process *ReturnHeader* column
The ``ReturnHeader`` column contains some key pieces of information on the organization and its 990 filing. In the XML and JSON versions of the file these data are all 'nested' under the *ReturnHeader*. We thus need to 'flatten' these data such that each variable has its own column. For this task we are going to apply the ``json_normalize`` function in PANDAS. What the code below is saying is (re-)create our dataframe ``df`` by joining ``df`` without *ReturnHeader* with the flattened *ReturnHeader* columns. The new ``df`` will have the same number of rows but more columns -- instead of one *ReturnHeader* column we will have multiple new, non-nested columns. 

In [20]:
print(len(df.columns))
df.columns

473


Index(['_id', 'OrganizationName', 'URL', 'DLN', 'TaxPeriod', 'ReturnHeader',
       'AddressChange', 'NameOfPrincipalOfficerPerson', 'GrossReceipts',
       'GroupReturnForAffiliates',
       ...
       'InitialReturnInd', 'GamingGrossIncomeAmt', 'GamingDirectExpensesAmt',
       'MethodOfAccountingOtherInd', 'InvestmentExpenseAmt',
       'Organization501cInd', 'Organization4947a1NotPFInd', 'AmendedReturnInd',
       'SpecialConditionDesc', 'ActivityCd'],
      dtype='object', length=473)

In [18]:
#%%time
#df = pd.concat([df.drop(['ReturnHeader'], axis=1), pd.json_normalize(df['ReturnHeader'], max_level=0)], axis=1)
#print(len(df))
#df[:1]

Please refer to https://modin.readthedocs.io/en/stable/supported_apis/defaulting_to_pandas.html for explanation.


AttributeError: 'str' object has no attribute 'values'

In [19]:
#%%time
##DIAGNOSE AttributeError: 'str' object has no attribute 'values'
#df['ReturnHeader'].apply(type).value_counts()

the groupby keys will be sorted anyway, although the 'sort=False' was passed. See the following issue for more details: https://github.com/modin-project/modin/issues/3571.


ReturnHeader
<class 'str'>    3469008
Name: count, dtype: int64

# This works but I'll use the function method below instead

In [21]:
%%time
import json
from tqdm import tqdm

# Optional: use tqdm to show progress
tqdm.pandas()

# Step 1: Convert stringified JSON to dicts
#df['ReturnHeader_dict'] = df['ReturnHeader'].progress_apply(json.loads)

# Modin DataFrame → Pandas Series
df['ReturnHeader_dict'] = df['ReturnHeader']._to_pandas().progress_apply(json.loads)

# Step 2: Flatten the nested dicts
flattened = pd.json_normalize(df['ReturnHeader_dict'], max_level=0)

# Step 3: Merge flattened columns back into main df
df = pd.concat([df.drop(['ReturnHeader', 'ReturnHeader_dict'], axis=1), flattened], axis=1)

print("✅ ReturnHeader successfully parsed and flattened.")

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3469008/3469008 [03:12<00:00, 17991.26it/s]


✅ ReturnHeader successfully parsed and flattened.
CPU times: total: 21min 19s
Wall time: 29min 44s


In [125]:
df[:2]

Unnamed: 0,_id,OrganizationName,URL,DLN,TaxPeriod,EIN,ReturnHeader,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,InitialReturnInd,GamingGrossIncomeAmt,GamingDirectExpensesAmt,MethodOfAccountingOtherInd,InvestmentExpenseAmt,Organization501cInd,Organization4947a1NotPFInd,AmendedReturnInd,SpecialConditionDesc,ActivityCd
0,5d019e6778ffca27b42818d7,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,"{""@binaryAttachmentCount"": ""0"", ""Timestamp"": ""2011-11-09T06:41:09-06:00"", ""TaxPeriodEndDate"": ""2010-12-31"", ""PreparerFirm"": {""PreparerFirmBusinessName"": {""BusinessNameLine1"": ""CONCANNON MILLER & CO PC""}, ""PreparerFirmUSAddress"": {""AddressLine1"": ""1525 VALLEY CENTER PARKWAY SUITE 30"", ""City"": ""BETHLEHEM"", ""State"": ""PA"", ""ZIPCode"": ""180172285""}}, ""ReturnType"": ""990"", ""TaxPeriodBeginDate"": ""2010-01-01"", ""Filer"": {""EIN"": ""232705170"", ""Name"": {""BusinessNameLine1"": ""RONALD MCDONALD HOUSE CHARITIES...",X,MICHAEL ANTON,1473903,0,,,,,,,,,,
1,5d019e6778ffca27b42818d8,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,201106,581805618,"{""@binaryAttachmentCount"": ""0"", ""Timestamp"": ""2011-11-09T07:32:06-08:00"", ""TaxPeriodEndDate"": ""2011-06-30"", ""PreparerFirm"": {""PreparerFirmBusinessName"": {""BusinessNameLine1"": ""MADDOX & ASSOCIATES APC""}, ""PreparerFirmUSAddress"": {""AddressLine1"": ""5627 BANKERS AVE BLDG 2"", ""City"": ""BATON ROUGE"", ""State"": ""LA"", ""ZIPCode"": ""708082610""}}, ""ReturnType"": ""990"", ""TaxPeriodBeginDate"": ""2010-07-01"", ""Filer"": {""EIN"": ""581805618"", ""Name"": {""BusinessNameLine1"": ""TORRINGTON VOA ELDERLY HOUSING INC"", ""Busi...",,,266420,false,,,,,,,,,,


In [22]:
df[:2]

Unnamed: 0,_id,OrganizationName,URL,DLN,TaxPeriod,EIN,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,TaxPeriodBeginDt,BusinessOfficerGrp,PreparerPersonGrp,TaxYr,DisasterReliefTxt,FilingSecurityInformation,SigningOfficerGrp,AdditionalFilerInformation,IRSResponsiblePrtyInfoCurrInd,Form8822BAttachedInd
0,5d019e6778ffca27b42818d7,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,X,MICHAEL ANTON,1473903,0,X,,,,,,,,,,
1,5d019e6778ffca27b42818d8,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,201106,581805618,,,266420,false,X,,,,,,,,,,


🧠 Possible Reasons You’re Now Seeing Errors
IRS schema or formatting changed in newer filings:

e.g., "ReturnHeader" was always a dict, now sometimes it's a JSON string

or fields moved deeper into subkeys (nested structure)

Your Parquet loader may be converting dicts to strings on read (e.g., from pyarrow)

Mixed content: even if 99.9% are strings, a few None or malformed values will break json.loads()



#### 🧪 Bonus: Detect if IRS format changed
You can do a quick inspection:

In [83]:
sample = df['Filer'].sample(10).tolist()
for i, s in enumerate(sample):
    try:
        print(f"\n[{i}]")
        print(json.dumps(json.loads(s), indent=2))
    except Exception as e:
        print(f"[{i}] ❌ Parse error:", e)




[0]
[0] ❌ Parse error: the JSON object must be str, bytes or bytearray, not dict

[1]
[1] ❌ Parse error: the JSON object must be str, bytes or bytearray, not dict

[2]
[2] ❌ Parse error: the JSON object must be str, bytes or bytearray, not dict

[3]
[3] ❌ Parse error: the JSON object must be str, bytes or bytearray, not dict

[4]
[4] ❌ Parse error: the JSON object must be str, bytes or bytearray, not dict

[5]
[5] ❌ Parse error: the JSON object must be str, bytes or bytearray, not dict

[6]
[6] ❌ Parse error: the JSON object must be str, bytes or bytearray, not dict

[7]
[7] ❌ Parse error: the JSON object must be str, bytes or bytearray, not dict

[8]
[8] ❌ Parse error: the JSON object must be str, bytes or bytearray, not dict

[9]
[9] ❌ Parse error: the JSON object must be str, bytes or bytearray, not dict


### ✅ Safer + Faster Parsing Strategy
Here’s a hardened and reusable version of your current approach:

In [22]:
import json
from pandas import json_normalize
from tqdm import tqdm

def safe_parse_json(value):
    try:
        return json.loads(value) if isinstance(value, str) else value
    except Exception as e:
        return None

def parse_json_column(df, column, prefix=None):
    tqdm.pandas()

    # Parse stringified JSON safely
    print(f"🔍 Parsing JSON column: {column}")
    parsed = df[column]._to_pandas().progress_apply(safe_parse_json)

    # Drop rows that failed to parse (optional: log count)
    valid = parsed.notnull()
    if valid.sum() < len(parsed):
        print(f"⚠️ {len(parsed) - valid.sum()} rows could not be parsed and will be dropped.")

    # Normalize nested structure
    print("🪄 Normalizing...")
    flattened = json_normalize(parsed[valid], max_level=0)

    if prefix:
        if prefix != "none":
            flattened.columns = [f"{prefix}_{col}" for col in flattened.columns]        

    # Rebuild full DataFrame
    #df_valid = df[valid].drop(columns=[column])
    df_valid = df[pd.Series(valid.values, index=valid.index)].drop(columns=[column])
    result = pd.concat([df_valid, flattened], axis=1)

    return result

#### ✅ Use It Like This:
You’ll get:

- Auto tqdm progress bar

- Safe JSON parsing

- Clean json_normalize

- Dropped rows that fail (or could be flagged instead)

- Optional prefixing of flattened columns

In [23]:
%%time
df = parse_json_column(df, 'ReturnHeader', prefix='')

🔍 Parsing JSON column: ReturnHeader


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3469008/3469008 [02:19<00:00, 24795.61it/s]


🪄 Normalizing...




CPU times: total: 9min 59s
Wall time: 11min 17s


<br>We now have 16 additional columns as seen in the following block.

In [28]:
# Step 1: Check RAM before saving
check_memory_status()

# Step 2: Clean up df (especially if you’ve been doing .head(), .sort(), etc.)
#df_clean = prepare_for_save(df)

🧠 RAM Usage: 50.4% (103.10 GB / 204.69 GB)
✅ Good to go!


In [29]:
print('# of columns in df:', len(df.columns), '\n')
df.columns[-18:]

# of columns in df: 499 



Index(['TaxYear', 'BuildTS', 'DisasterRelief', '@binaryAttachmentCnt',
       'ReturnTs', 'TaxPeriodEndDt', 'PreparerFirmGrp', 'ReturnTypeCd',
       'TaxPeriodBeginDt', 'BusinessOfficerGrp', 'PreparerPersonGrp', 'TaxYr',
       'DisasterReliefTxt', 'FilingSecurityInformation', 'SigningOfficerGrp',
       'AdditionalFilerInformation', 'IRSResponsiblePrtyInfoCurrInd',
       'Form8822BAttachedInd'],
      dtype='object')

<br>Not all of these contain information that is useful for us, so we will delete some of the unneeded *ReturnHeader* columns

In [30]:
omit_cols = ['@binaryAttachmentCnt', '@binaryAttachmentCount',
             'PreparerFirmGrp', 'PreparerFirm', 
             'ReturnTypeCd',  'ReturnType',  
             'PreparerPersonGrp', 'Preparer', 
             'DisasterReliefTxt', 'DisasterRelief', 
             'FilingSecurityInformation']

In [27]:
df[omit_cols][-2:]

KeyboardInterrupt: 

In [None]:
#pd.set_option('max_colwidth', 500)

#####  Check for missing columns -- these are all from ReturnHeaderGrp
Columns in `df` but not `mongo_cols`

In [32]:
%%time
print([c for c in df.columns.tolist() if c not in mongo_cols])

['_id', 'OrganizationName', 'URL', 'DLN', '@binaryAttachmentCount', 'PreparerFirm', 'ReturnType', 'Preparer', 'DisasterRelief', '@binaryAttachmentCnt', 'PreparerFirmGrp', 'ReturnTypeCd', 'PreparerPersonGrp', 'DisasterReliefTxt', 'FilingSecurityInformation', 'SigningOfficerGrp', 'AdditionalFilerInformation', 'IRSResponsiblePrtyInfoCurrInd', 'Form8822BAttachedInd']
CPU times: total: 0 ns
Wall time: 5.5 ms


In [33]:
set(mongo_cols) - set(df.columns.tolist())

set()

In [34]:
%%time
print('# of columns in df:', len(df.columns), '\n')
df = df[[c for c in df.columns.tolist() if c not in omit_cols]]
print(len(df))
print('# of columns in df:', len(df.columns), '\n')
df[:1]

# of columns in df: 499 

3469008
# of columns in df: 488 

CPU times: total: 78.1 ms
Wall time: 62.3 ms


Unnamed: 0,_id,OrganizationName,URL,DLN,TaxPeriod,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,BuildTS,ReturnTs,TaxPeriodEndDt,TaxPeriodBeginDt,BusinessOfficerGrp,TaxYr,SigningOfficerGrp,AdditionalFilerInformation,IRSResponsiblePrtyInfoCurrInd,Form8822BAttachedInd
0,5d019e6778ffca27b42818d7,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,X,MICHAEL ANTON,1473903,0,X,,2016-02-24 21:20:13Z,,,,,,,,,


#### Save DF

In [35]:
check_memory_status()

🧠 RAM Usage: 42.6% (87.12 GB / 204.69 GB)
✅ Good to go!


In [36]:
%%time
import datetime
print("🕓 Save started:", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
# Step 3: Save
#df_clean.to_parquet("D:/filings_full.parquet", engine="pyarrow", compression="snappy", index=False)
df.to_parquet("D:/all_filings_april_2025_all_controls.parquet", engine="pyarrow", compression="snappy", index=False)

print("✅ Save completed:", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

🕓 Save started: 2025-04-14 23:06:33
✅ Save completed: 2025-04-14 23:07:51
CPU times: total: 4.59 s
Wall time: 1min 18s


In [37]:
check_memory_status()

🧠 RAM Usage: 46.9% (96.00 GB / 204.69 GB)
✅ Good to go!


### Create *Fiscal Year* Variable

In [38]:
[c for c in df.columns.tolist() if 'Tax' in c]

['TaxPeriod',
 'PayrollTaxes',
 'TaxExemptBondLiabilities',
 'PayrollTaxesGrp',
 'TaxExemptBondLiabilitiesGrp',
 'TaxPeriodEndDate',
 'TaxPeriodBeginDate',
 'TaxYear',
 'TaxPeriodEndDt',
 'TaxPeriodBeginDt',
 'TaxYr']

<br>We'll run a block of code to show the number of observations that have and are missing values for *TaxPeriod*

##### Note: This variable is not in the new filings

In [34]:
#%%time
#print(len(df[df['TaxPeriod'].notnull()]))
#rint(len(df[df['TaxPeriod'].isnull()]))

In [39]:
%%time
print(len(df[df['TaxPeriodEndDt'].notnull()]))
print(len(df[df['TaxPeriodEndDt'].isnull()]))

2973485
495523
CPU times: total: 5.25 s
Wall time: 45.1 s


<br>We can show here the top 5 frequencies for this variable.

In [60]:
#df['TaxPeriod'].value_counts().head()

201912    146388
201812    145700
201712    138828
201612    131311
201512    124527
Name: TaxPeriod, dtype: int64

##### Note the different format

In [50]:
df['TaxPeriodEndDt'].value_counts().head()

TaxPeriodEndDt
2023-12-31    202778
2022-12-31    202158
2021-12-31    193368
2020-12-31    181339
2019-12-31    156501
Name: count, dtype: int64

<br>Show the data type for *TaxPeriod*. It is ``O``, short for 'object' (string variable).

In [41]:
df['TaxPeriodEndDt'].dtype

dtype('O')

<br>We'll create a new variable, *fiscal year*, that comprises the first four characters of the *TaxPeriod* value

In [62]:
#df['fiscal_year'] = df['TaxPeriod'].str[:4]
#df['fiscal_year'].value_counts()

2019    265281
2018    261382
2017    251401
2016    240304
2015    228000
2014    210538
2013    190710
2012    170761
2020    135692
2011    126923
2010     22562
2021       878
2108         1
2001         1
2000         1
Name: fiscal_year, dtype: int64

In [40]:
check_memory_status()

🧠 RAM Usage: 47.6% (97.46 GB / 204.69 GB)
✅ Good to go!


In [42]:
df['fiscal_year'] = df['TaxPeriodEndDt'].str[:4]
df['fiscal_year'].value_counts()

the groupby keys will be sorted anyway, although the 'sort=False' was passed. See the following issue for more details: https://github.com/modin-project/modin/issues/3571.


fiscal_year
2023    346694
2022    345378
2021    332441
2020    308651
2019    276308
2018    261873
2017    251414
2016    240304
2015    228000
2014    210538
2013    103432
2024     68432
2025        16
2001         2
2000         1
2012         1
Name: count, dtype: int64

<br>To get a round sense of the breakdown of the data by year we will create a new dataset called *years*, rename the first column, sort the dataset and then show the data. You can see here that the single observations for 2000, 2001, and 2108 must be data entry errors. The rest of the values are as expected: the filings run from 2010 through 2021. 

Side note: we will use a different variable in our regressions for tax year. We'll get to that in subsequent notebooks.

In [43]:
%%time
years = pd.DataFrame(df['fiscal_year'].value_counts())
years.index.name = 'year'
years = years.reset_index()
years = years.sort_values('year')
years

CPU times: total: 46.9 ms
Wall time: 436 ms


Unnamed: 0,year,count
14,2000,1
13,2001,2
15,2012,1
10,2013,103432
9,2014,210538
8,2015,228000
7,2016,240304
6,2017,251414
5,2018,261873
4,2019,276308


In [55]:
df[['TaxPeriod', 'TaxPeriodEndDate', 'TaxPeriodBeginDate', 'TaxYear', 'TaxPeriodEndDt', 'TaxPeriodBeginDt', 'TaxYr']].sample(10)

Unnamed: 0,TaxPeriod,TaxPeriodEndDate,TaxPeriodBeginDate,TaxYear,TaxPeriodEndDt,TaxPeriodBeginDt,TaxYr
1718312,201812.0,,,,2018-12-31,2018-01-01,2018
810094,201506.0,,,,2015-06-30,2014-07-01,2014
1438670,201712.0,,,,2017-12-31,2017-01-01,2017
1569320,201805.0,,,,2018-05-31,2017-06-01,2017
2139223,202006.0,,,,2020-06-30,2019-07-01,2019
1749510,201905.0,,,,2019-05-31,2018-06-01,2018
1645486,201812.0,,,,2018-12-31,2018-01-01,2018
1142566,201612.0,,,,2016-12-31,2016-01-01,2016
1117558,201612.0,,,,2016-12-31,2016-01-01,2016
3343174,,,,,2023-12-31,2023-01-01,2023


In [44]:
df['TaxYear'].value_counts().sort_index()

TaxYear
2009     33310
2010    123025
2011    159504
2012    179684
Name: count, dtype: int64

In [45]:
df['TaxYr'].value_counts().sort_index()

TaxYr
2013    198710
2014    218590
2015    233520
2016    243852
2017    254505
2018    265961
2019    283662
2020    320246
2021    336484
2022    346046
2023    268392
2024      3517
Name: count, dtype: int64

In [46]:
print("Number of columns:", len(df.columns))
print("Number of observations:", len(df))
df[:1]    

Number of columns: 489
Number of observations: 3469008


Unnamed: 0,_id,OrganizationName,URL,DLN,TaxPeriod,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,ReturnTs,TaxPeriodEndDt,TaxPeriodBeginDt,BusinessOfficerGrp,TaxYr,SigningOfficerGrp,AdditionalFilerInformation,IRSResponsiblePrtyInfoCurrInd,Form8822BAttachedInd,fiscal_year
0,5d019e6778ffca27b42818d7,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,X,MICHAEL ANTON,1473903,0,X,,,,,,,,,,,


### Initital Verifications - Check whether ``df`` contains all relevant columns
First, we'll create a Python *list* that contains the ID columns we added to the top line of our ``cursor`` earlier.

In [47]:
id_cols = ['DLN', 'EIN', 'URL', 'OrganizationName']

<br>Here we take advantage of Python's 'set' capabilities to compare the columns in our dataset to the columns we are expecting, which are represented by the *mongo_cols* and *id_cols* lists we have created. 

The first line below uses the ``len`` function to tell us how many columns in our dataframe are not in *mongo_cols* or *id_cols*. The answer is 5. And the second line shows us what those columns are.

In [48]:
print(len(set(df.columns.tolist())) - len(set(mongo_cols)) - len(set(id_cols)))
set(df.columns.tolist()) - set(mongo_cols) - set(id_cols)

5


{'AdditionalFilerInformation',
 'Form8822BAttachedInd',
 'IRSResponsiblePrtyInfoCurrInd',
 'SigningOfficerGrp',
 '_id',
 'fiscal_year'}

<br>Let's see two sample rows for these columns. 

In [49]:
df[['_id', 'AdditionalFilerInformation', 'Form8822BAttachedInd',
 'IRSResponsiblePrtyInfoCurrInd', 'SigningOfficerGrp', 'fiscal_year']][:2]

Unnamed: 0,_id,AdditionalFilerInformation,Form8822BAttachedInd,IRSResponsiblePrtyInfoCurrInd,SigningOfficerGrp,fiscal_year
0,5d019e6778ffca27b42818d7,,,,,
1,5d019e6778ffca27b42818d8,,,,,


In [50]:
df[['_id', 'AdditionalFilerInformation', 'Form8822BAttachedInd',
 'IRSResponsiblePrtyInfoCurrInd', 'SigningOfficerGrp', 'fiscal_year']].sample(5)

Unnamed: 0,_id,AdditionalFilerInformation,Form8822BAttachedInd,IRSResponsiblePrtyInfoCurrInd,SigningOfficerGrp,fiscal_year
2055566,617c728cb1ca7b56cbd942b6,,,,,2020.0
1061563,5d06eea778ffca27b4384b98,,,,,2016.0
1219964,5d07b1dc78ffca27b43ab659,,,,,2016.0
820800,5d05bd3078ffca27b4349f1d,,,,,2015.0
622074,5d04beb878ffca27b43196d4,,,,,


In [68]:
df['IRSResponsiblePrtyInfoCurrInd'].describe()

count     545326
unique         4
top         true
freq      230199
Name: IRSResponsiblePrtyInfoCurrInd, dtype: object

In [69]:
df['Form8822BAttachedInd'].describe()

count     1749
unique       1
top          X
freq      1749
Name: Form8822BAttachedInd, dtype: object

<br>I don't want to use some of these variables so let's drop them from our dataframe.

In [51]:
%%time
df = df.drop('AdditionalFilerInformation', axis=1)
df = df.drop('SigningOfficerGrp', axis=1)
df = df.drop('Form8822BAttachedInd', axis=1)
df = df.drop('IRSResponsiblePrtyInfoCurrInd', axis=1)

CPU times: total: 78.1 ms
Wall time: 96.9 ms


<br>Check whether any columns in *mongo_cols* are missing from our dataframe. 

### These missing *mongo_col* variables are -- potentially -- the problem variables
Update 4/14/2025 - there are now none. They were missing in last run because I only used new filings

In [52]:
set(mongo_cols) - set(df.columns.tolist())

set()

<br>It might not be a problem though -- because while there is not, for example, *AddressChange*, there is *AddressChangeInd*.

Also, see the bottom of notebook (5) in this series; it looks like we have all the variables. 

In [76]:
df[['AddressChangeInd']].sample(5)

Unnamed: 0,AddressChangeInd
1418109,
1833977,
3268843,
1320741,
1205254,


In [53]:
df['Filer'][:1]

0    {'EIN': '232705170', 'Name': {'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}, 'NameControl': 'RONA', 'Phone': '8565826843', 'USAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300...
Name: Filer, dtype: object

In [54]:
set(['EIN', 'OrganizationName', 'DLN', 'URL',  'ReturnHeader']) - set(df.columns.tolist())

{'EIN', 'ReturnHeader'}

In [55]:
set(df.columns.tolist()) - set(mongo_cols)

{'DLN', 'OrganizationName', 'URL', '_id', 'fiscal_year'}

<br>Show number of observations

In [56]:
len(df)

3469008

<br>Show number of columns

In [57]:
len(df.columns)

485

#### Process *Filer* column

In [58]:
%%time 
df[['Filer']].sample(5)

CPU times: total: 234 ms
Wall time: 286 ms


Unnamed: 0,Filer
2672847,"{'EIN': '591368568', 'BusinessName': {'BusinessNameLine1Txt': 'AUBURN WATER SYSTEM INC'}, 'BusinessNameControlTxt': 'AUBU', 'PhoneNum': '8506823413', 'USAddress': {'AddressLine1Txt': '3097 LOCKE LANE', 'CityNm': 'CRESTVIEW', 'StateAbbreviationCd'..."
666824,"{'EIN': '231512747', 'BusinessName': {'BusinessNameLine1': 'HOLY SPIRIT HOSPITAL'}, 'BusinessNameControlTxt': 'HOLY', 'PhoneNum': '7177632100', 'USAddress': {'AddressLine1': '503 NORTH 21ST STREET', 'City': 'CAMP HILL', 'State': 'PA', 'ZIPCode': ..."
534650,"{'EIN': '430907069', 'BusinessName': {'BusinessNameLine1': 'MONETT AREA EXTENDED EMPLOYMENT', 'BusinessNameLine2': 'WORKSHOP INC'}, 'BusinessNameControlTxt': 'MONE', 'PhoneNum': '4172353191', 'USAddress': {'AddressLine1': '204 S CENTRAL', 'City':..."
1383180,"{'EIN': '420934286', 'BusinessName': {'BusinessNameLine1Txt': 'CLARENCE NURSING HOME INC'}, 'BusinessNameControlTxt': 'CLAR', 'PhoneNum': '5634523262', 'USAddress': {'AddressLine1Txt': '402 2ND AVE', 'CityNm': 'CLARENCE', 'StateAbbreviationCd': '..."
2605701,"{'EIN': '351786005', 'BusinessName': {'BusinessNameLine1Txt': 'Rehabilitation Hospital of Indiana Inc'}, 'BusinessNameControlTxt': 'REHA', 'PhoneNum': '3173292000', 'USAddress': {'AddressLine1Txt': '4141 Shore Drive', 'CityNm': 'Indianapolis', 'S..."


In [60]:
check_memory_status()

🧠 RAM Usage: 48.2% (98.69 GB / 204.69 GB)
✅ Good to go!


In [61]:
%%time
#df = parse_json_column(df, 'Filer', prefix='filer')
df = parse_json_column(df, 'Filer', prefix="")  # No prefix added -- NEXT TIME RUN THIS

🔍 Parsing JSON column: Filer


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3469008/3469008 [00:05<00:00, 664295.83it/s]


🪄 Normalizing...




CPU times: total: 2min 34s
Wall time: 2min 52s


#### Save DF

In [62]:
check_memory_status()

🧠 RAM Usage: 51.3% (104.97 GB / 204.69 GB)
✅ Good to go!


In [63]:
%%time
import datetime
print("🕓 Save started:", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
# Step 3: Save
#df_clean.to_parquet("D:/filings_full.parquet", engine="pyarrow", compression="snappy", index=False)
df.to_parquet("D:/all_filings_april_2025_all_controls.parquet", engine="pyarrow", compression="snappy", index=False)

print("✅ Save completed:", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

🕓 Save started: 2025-04-14 23:30:18
✅ Save completed: 2025-04-14 23:31:39
CPU times: total: 4.48 s
Wall time: 1min 20s


In [89]:
df[:2]

Unnamed: 0,_id,OrganizationName,URL,DLN,TaxPeriod,EIN,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,filer_NameControl,filer_Phone,filer_USAddress,filer_ForeignAddress,filer_InCareOfName,filer_BusinessName,filer_BusinessNameControlTxt,filer_PhoneNum,filer_InCareOfNm,filer_ForeignPhoneNum
0,5d019e6778ffca27b42818d7,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,X,MICHAEL ANTON,1473903,0,X,RONA,8565826843,"{'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300', 'City': 'BETHLEHEM', 'State': 'PA', 'ZIPCode': '18017'}",,,,,,,
1,5d019e6778ffca27b42818d8,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,201106,581805618,,,266420,false,X,TORR,7033415000,"{'AddressLine1': '1660 DUKE STREET', 'City': 'ALEXANDRIA', 'State': 'VA', 'ZIPCode': '22314'}",,,,,,,


In [94]:
[c for c in df.columns if 'filer_' in c]

['filer_EIN',
 'filer_Name',
 'filer_NameControl',
 'filer_Phone',
 'filer_USAddress',
 'filer_ForeignAddress',
 'filer_InCareOfName',
 'filer_BusinessName',
 'filer_BusinessNameControlTxt',
 'filer_PhoneNum',
 'filer_InCareOfNm',
 'filer_ForeignPhoneNum']

#### ✅ Quick Fix: Remove 'filer_' Prefix from All Matching Columns

In [95]:
df.columns = [col.replace("filer_", "") if col.startswith("filer_") else col for col in df.columns]

In [64]:
[c for c in df.columns if 'filer_' in c]

[]

In [65]:
df[:2]

Unnamed: 0,_id,OrganizationName,URL,DLN,TaxPeriod,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,NameControl,Phone,USAddress,ForeignAddress,InCareOfName,BusinessName,BusinessNameControlTxt,PhoneNum,InCareOfNm,ForeignPhoneNum
0,5d019e6778ffca27b42818d7,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,X,MICHAEL ANTON,1473903,0,X,,RONA,8565826843,"{'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300', 'City': 'BETHLEHEM', 'State': 'PA', 'ZIPCode': '18017'}",,,,,,,
1,5d019e6778ffca27b42818d8,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,201106,,,266420,false,X,,TORR,7033415000,"{'AddressLine1': '1660 DUKE STREET', 'City': 'ALEXANDRIA', 'State': 'VA', 'ZIPCode': '22314'}",,,,,,,


In [176]:
#%%time
#df = pd.concat([df.drop(['Filer'], axis=1), pd.json_normalize(df['Filer'], max_level=0)], axis=1)
#print(len(df))
#df[:1]

891980
CPU times: total: 40.1 s
Wall time: 41.8 s


Unnamed: 0,PrincipalOfficerNm,GrossReceiptsAmt,GroupReturnForAffiliatesInd,Organization501cInd,WebsiteAddressTxt,TypeOfOrganizationCorpInd,FormationYr,LegalDomicileStateCd,ActivityOrMissionDesc,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,TotalEmployeeCnt,TotalVolunteersCnt,TotalGrossUBIAmt,PYContributionsGrantsAmt,CYContributionsGrantsAmt,CYProgramServiceRevenueAmt,PYInvestmentIncomeAmt,CYInvestmentIncomeAmt,PYOtherRevenueAmt,CYOtherRevenueAmt,PYTotalRevenueAmt,CYTotalRevenueAmt,CYGrantsAndSimilarPaidAmt,PYBenefitsPaidToMembersAmt,CYBenefitsPaidToMembersAmt,PYSalariesCompEmpBnftPaidAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalProfFndrsngExpnsAmt,CYTotalFundraisingExpenseAmt,PYOtherExpensesAmt,CYOtherExpensesAmt,PYTotalExpensesAmt,CYTotalExpensesAmt,PYRevenuesLessExpensesAmt,CYRevenuesLessExpensesAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalLiabilitiesBOYAmt,TotalLiabilitiesEOYAmt,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,MissionDesc,SignificantNewProgramSrvcInd,SignificantChangeInd,Desc,ProgSrvcAccomActy2Grp,ProgSrvcAccomActy3Grp,PoliticalCampaignActyInd,LobbyingActivitiesInd,ProfessionalFundraisingInd,FundraisingActivitiesInd,GamingActivitiesInd,EngagedInExcessBenefitTransInd,PYExcessBenefitTransInd,DisregardedEntityInd,RelatedEntityInd,RelatedOrganizationCtrlEntInd,TransactionWithControlEntInd,ActivitiesConductedPrtshpInd,IRPDocumentCnt,EmployeeCnt,UnrelatedBusIncmOverLimitInd,InfoInScheduleOPartVIInd,GoverningBodyVotingMembersCnt,IndependentVotingMemberCnt,FamilyOrBusinessRlnInd,DelegationOfMgmtDutiesInd,ChangeToOrgDocumentsInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,ElectionOfBoardMembersInd,DecisionsSubjectToApprovaInd,MinutesOfGoverningBodyInd,MinutesOfCommitteesInd,OfficerMailingAddressInd,LocalChaptersInd,Form990ProvidedToGvrnBodyInd,ConflictOfInterestPolicyInd,AnnualDisclosureCoveredPrsnInd,RegularMonitoringEnfrcInd,WhistleblowerPolicyInd,DocumentRetentionPolicyInd,CompensationProcessCEOInd,CompensationProcessOtherInd,InvestmentInJointVentureInd,StatesWhereCopyOfReturnIsFldCd,UponRequestInd,TotalReportableCompFromOrgAmt,FormerOfcrEmployeesListedInd,TotalCompGreaterThan150KInd,CompensationFromOtherSrcsInd,MembershipDuesAmt,TotalContributionsAmt,OtherRevenueTotalAmt,TotalRevenueGrp,BenefitsToMembersGrp,CompCurrentOfcrDirectorsGrp,OtherSalariesAndWagesGrp,FeesForServicesAccountingGrp,OfficeExpensesGrp,OccupancyGrp,TravelGrp,ConferencesMeetingsGrp,DepreciationDepletionGrp,InsuranceGrp,AllOtherExpensesGrp,TotalFunctionalExpensesGrp,CashNonInterestBearingGrp,AccountsReceivableGrp,InventoriesForSaleOrUseGrp,PrepaidExpensesDefrdChargesGrp,LandBldgEquipCostOrOtherBssAmt,LandBldgEquipAccumDeprecAmt,LandBldgEquipBasisNetGrp,InvestmentsPubTradedSecGrp,TotalAssetsGrp,AccountsPayableAccrExpnssGrp,DeferredRevenueGrp,OrgDoesNotFollowSFAS117Ind,RtnEarnEndowmentIncmOthFndsGrp,ReconcilationRevenueExpnssAmt,MethodOfAccountingAccrualInd,AccountantCompileOrReviewInd,FSAuditedInd,FederalGrantAuditRequiredInd,URL,ExpenseAmt,TotalProgramServiceExpensesAmt,NoListedPersonsCompensatedInd,OtherExpensesGrp,SavingsAndTempCashInvstGrp,MethodOfAccountingCashInd,NetUnrelatedBusTxblIncmAmt,PYProgramServiceRevenueAmt,PYGrantsAndSimilarPaidAmt,PYTotalProfFndrsngExpnsAmt,TotReportableCompRltdOrgAmt,TotalOtherCompensationAmt,IndivRcvdGreaterThan100KCnt,CntrctRcvdGreaterThan100KCnt,AllOtherContributionsAmt,TotalProgramServiceRevenueAmt,FeesForServicesManagementGrp,InformationTechnologyGrp,OtherLiabilitiesGrp,OrganizationFollowsSFAS117Ind,UnrestrictedNetAssetsGrp,InterestGrp,InvestmentsProgramRelatedGrp,OtherAssetsTotalGrp,MortgNotesPyblScrdInvstPropGrp,InfoInScheduleOPartXIInd,AuditCommitteeInd,FederalGrantAuditPerformedInd,GroupExemptionNum,TypeOfOrganizationAssocInd,InfoInScheduleOPartIXInd,GrantsToDomesticOrgsGrp,FeesForServicesOtherGrp,PaymentsToAffiliatesGrp,InfoInScheduleOPartIIIInd,GrantAmt,ProgSrvcAccomActyOtherGrp,TotalOtherProgSrvcExpenseAmt,FundraisingGrossIncomeAmt,FundraisingDirectExpensesAmt,GrantsToDomesticIndividualsGrp,FeesForServicesLegalGrp,AllAffiliatesIncludedInd,TypeOfOrganizationTrustInd,ForeignGrantsGrp,CompDisqualPersonsGrp,PensionPlanContributionsGrp,OtherEmployeeBenefitsGrp,PayrollTaxesGrp,FeesForServicesLobbyingGrp,FeesForServicesProfFundraising,FeesForSrvcInvstMgmntFeesGrp,AdvertisingGrp,RoyaltiesGrp,PymtTravelEntrtnmntPubOfclGrp,PledgesAndGrantsReceivableGrp,RcvblFromDisqualifiedPrsnGrp,OthNotesLoansReceivableNetGrp,InvestmentsOtherSecuritiesGrp,IntangibleAssetsGrp,NetUnrlzdGainsLossesInvstAmt,FundraisingAmt,ContriRptFundraisingEventAmt,GamingGrossIncomeAmt,GamingDirectExpensesAmt,GrossSalesOfInventoryAmt,CostOfGoodsSoldAmt,RevenueAmt,OtherWebsiteInd,InfoInScheduleOPartVIIIInd,TemporarilyRstrNetAssetsGrp,PermanentlyRstrNetAssetsGrp,PriorPeriodAdjustmentsAmt,InfoInScheduleOPartXIIInd,MethodOfAccountingOtherInd,TypeOfOrganizationOtherInd,InfoInScheduleOPartVIIInd,PoliciesReferenceChaptersInd,NoncashContributionsAmt,OwnWebsiteInd,GovernmentGrantsAmt,WrittenPolicyOrProcedureInd,InfoInScheduleOPartVInd,TotalOtherProgSrvcRevenueAmt,Organization501c3Ind,TrnsfrExmptNonChrtblRltdOrgInd,UnsecuredNotesLoansPayableGrp,AddressChangeInd,OtherOrganizationDsc,FederatedCampaignsAmt,RelatedOrganizationsAmt,GrantsPayableGrp,TaxExemptBondLiabilitiesGrp,EscrowAccountLiabilityGrp,LoansFromOfficersDirectorsGrp,DonatedServicesAndUseFcltsAmt,InvestmentExpenseAmt,TotalOtherProgSrvcGrantAmt,InfoInScheduleOPartXInd,AmendedReturnInd,InitialReturnInd,LegalDomicileCountryCd,SpecialConditionDesc,Organization4947a1NotPFInd,FinalReturnInd,ContractTerminationInd,TotalJointCostsGrp,ActivityCd,ReturnTs,TaxPeriodEndDt,TaxPeriodBeginDt,BusinessOfficerGrp,TaxYr,BuildTS,fiscal_year,EIN,BusinessName,BusinessNameControlTxt,PhoneNum,USAddress,InCareOfNm,ForeignAddress,ForeignPhoneNum
0,KAYLA RICHARDS,272756,False,"{'@organization501cTypeTxt': '5', '#text': 'X'}",,X,1916,OH,IMPROVE RURAL STANDARD OF LIVING.,10,10,7,39,0,278582,236036,0,15945,10719,29098,26001,323625,272756,0,239263,181041,20738,23105,0,0,94080,65279,354081,269425,-30456,3331,424800,424855,41546,38270,383254,386585,Improve rural standard of living.,False,False,BENEFITS PAID TO OR FOR MEMBERS - THIS IS PAID MEMBERSHIPS TO OHIO FARM BUREAU AND TO AMERICAN FARM BUREAU TO FUTHER THEIR EFFORTS IN PROGRAMMING AND PROMOTING THE FARMING COMMUNITY.,{'Desc': 'MEMBERSHIP - COSTS OF PROMOTING FARM BUREAU AND ITS MISSION. PROMOTION OF FARM BUREAU PROGRAMS AND EVENTS IN ORDER TO EDUCATE THE FARMER AND CONSUMER IN CURRENT FARMING AND FOOD ISSUES.'},"{'Desc': 'CONFERENCE, CONVENTIONS AND MEETINGS - EDUCATION OF VOLUNTEERS FOR THE PROMOTING AND MARKETING OF FARM ISSUES AND CURRENT EVENTS.'}",False,False,False,False,False,False,False,False,False,False,False,False,0,7,False,X,10,10,True,False,False,False,True,False,False,True,True,False,False,True,True,True,True,True,True,True,True,False,OH,X,2130,False,False,False,236036,236036,26001,"{'TotalRevenueColumnAmt': '272756', 'RelatedOrExemptFuncIncomeAmt': '26001', 'ExclusionAmt': '10719'}","{'TotalAmt': '181041', 'ProgramServicesAmt': '181041'}","{'TotalAmt': '1860', 'ProgramServicesAmt': '1860'}","{'TotalAmt': '21245', 'ProgramServicesAmt': '21245'}","{'TotalAmt': '2352', 'ManagementAndGeneralAmt': '2352'}","{'TotalAmt': '2283', 'ManagementAndGeneralAmt': '2283'}","{'TotalAmt': '3478', 'ProgramServicesAmt': '3242', 'ManagementAndGeneralAmt': '236'}","{'TotalAmt': '331', 'ProgramServicesAmt': '331'}","{'TotalAmt': '7018', 'ProgramServicesAmt': '7018'}","{'TotalAmt': '6770', 'ManagementAndGeneralAmt': '6770'}","{'TotalAmt': '697', 'ProgramServicesAmt': '697'}","{'TotalAmt': '42350', 'ProgramServicesAmt': '36428', 'ManagementAndGeneralAmt': '5922'}","{'TotalAmt': '269425', 'ProgramServicesAmt': '251862', 'ManagementAndGeneralAmt': '17563'}","{'BOYAmt': '74474', 'EOYAmt': '72904'}","{'BOYAmt': '8765', 'EOYAmt': '6278'}","{'BOYAmt': '1197', 'EOYAmt': '1197'}","{'BOYAmt': '882', 'EOYAmt': '733'}",165116,55332,"{'BOYAmt': '116555', 'EOYAmt': '109784'}","{'BOYAmt': '222927', 'EOYAmt': '233959'}","{'BOYAmt': '424800', 'EOYAmt': '424855'}","{'BOYAmt': '19368', 'EOYAmt': '11076'}","{'BOYAmt': '22178', 'EOYAmt': '27194'}",X,"{'BOYAmt': '383254', 'EOYAmt': '386585'}",3331,X,False,False,False,https://s3.amazonaws.com/irs-form-990/201812509349300101_public.xml,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2018-09-07T04:44:38-07:00,2018-07-31,2017-08-01,"{'PersonNm': 'KAYLA RICHARDS', 'PersonTitleTxt': 'ORGANIZATION DIRECTOR', 'PhoneNum': '4198338015', 'SignatureDt': '2018-08-29'}",2017,2022-09-23 18:48:47Z,2018,346526754,{'BusinessNameLine1Txt': 'Lucas County Farm Bureau'},LUCA,4198338015,"{'AddressLine1Txt': '109 Portage St', 'CityNm': 'Woodville', 'StateAbbreviationCd': 'OH', 'ZIPCd': '43469'}",,,


In [66]:
set(df.columns.tolist()) - set(mongo_cols)

{'BusinessName',
 'BusinessNameControlTxt',
 'DLN',
 'EIN',
 'ForeignAddress',
 'ForeignPhoneNum',
 'InCareOfName',
 'InCareOfNm',
 'Name',
 'NameControl',
 'OrganizationName',
 'Phone',
 'PhoneNum',
 'URL',
 'USAddress',
 '_id',
 'fiscal_year'}

In [67]:
df.columns

Index(['_id', 'OrganizationName', 'URL', 'DLN', 'TaxPeriod', 'AddressChange',
       'NameOfPrincipalOfficerPerson', 'GrossReceipts',
       'GroupReturnForAffiliates', 'Organization501c3',
       ...
       'NameControl', 'Phone', 'USAddress', 'ForeignAddress', 'InCareOfName',
       'BusinessName', 'BusinessNameControlTxt', 'PhoneNum', 'InCareOfNm',
       'ForeignPhoneNum'],
      dtype='object', length=496)

### Save DF

In [None]:
#%%time
#import datetime
#print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
#df.to_pickle('all NEW filings February 2024 - all control variables.pkl.gz', compression='gzip')

In [71]:
%%time
import datetime
print("🕓 Save started:", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
# Step 3: Save
#df_clean.to_parquet("D:/filings_full.parquet", engine="pyarrow", compression="snappy", index=False)
df.to_parquet("D:/all_filings_april_2025_all_controls.parquet", engine="pyarrow", compression="snappy", index=False)

print("✅ Save completed:", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

🕓 Save started: 2025-04-14 23:36:05
✅ Save completed: 2025-04-14 23:37:15
CPU times: total: 3.94 s
Wall time: 1min 10s


In [72]:
missing_count = df['EIN'].isna().sum()
total_count = len(df)

print(f"🧼 Missing EINs: {missing_count:,} out of {total_count:,} rows")
print(f"📉 Missing Rate: {missing_count / total_count:.2%}")

🧼 Missing EINs: 0 out of 3,469,008 rows
📉 Missing Rate: 0.00%


# 4/14/2025 - Ended here
The code below highlights an issue with there being two 'EIN' columns, with the one that was in column position 5 having a lot of missing values, especially after 2019. 

In [100]:
%%time
import datetime
print("🕓 Save started:", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
# Step 3: Save
#df_clean.to_parquet("D:/filings_full.parquet", engine="pyarrow", compression="snappy", index=False)
df.to_parquet("D:/all_filings_april_2025_all_controls.parquet", engine="pyarrow", compression="snappy", index=False)

print("✅ Save completed:", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

🕓 Save started: 2025-04-14 20:52:26


RayTaskError(ValueError): [36mray::_deploy_ray_func()[39m (pid=38384, ip=127.0.0.1)
  File "python\\ray\\_raylet.pyx", line 2036, in ray._raylet.execute_task
  File "python\\ray\\_raylet.pyx", line 2076, in ray._raylet.execute_task
  File "python\\ray\\_raylet.pyx", line 4386, in ray._raylet.CoreWorker.store_task_outputs
  File "C:\Users\Gregory\anaconda3\lib\site-packages\modin\core\execution\ray\implementations\pandas_on_ray\partitioning\virtual_partition.py", line 335, in _deploy_ray_func
    result = deployer(axis, f_to_deploy, f_args, f_kwargs, *deploy_args, **kwargs)
  File "C:\Users\Gregory\anaconda3\lib\site-packages\modin\logging\logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "C:\Users\Gregory\anaconda3\lib\site-packages\modin\core\dataframe\pandas\partitioning\axis_partition.py", line 462, in deploy_axis_func
    raise err
  File "C:\Users\Gregory\anaconda3\lib\site-packages\modin\core\dataframe\pandas\partitioning\axis_partition.py", line 457, in deploy_axis_func
    result = func(dataframe, *f_args, **f_kwargs)
  File "C:\Users\Gregory\anaconda3\lib\site-packages\modin\core\io\column_stores\parquet_dispatcher.py", line 948, in func
    df.to_parquet(**kwargs)
  File "C:\Users\Gregory\anaconda3\lib\site-packages\pandas\util\_decorators.py", line 333, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\Gregory\anaconda3\lib\site-packages\pandas\core\frame.py", line 3113, in to_parquet
    return to_parquet(
  File "C:\Users\Gregory\anaconda3\lib\site-packages\pandas\io\parquet.py", line 480, in to_parquet
    impl.write(
  File "C:\Users\Gregory\anaconda3\lib\site-packages\pandas\io\parquet.py", line 190, in write
    table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
  File "pyarrow\\table.pxi", line 4751, in pyarrow.lib.Table.from_pandas
  File "C:\Users\Gregory\anaconda3\lib\site-packages\pyarrow\pandas_compat.py", line 595, in dataframe_to_arrays
    convert_fields) = _get_columns_to_convert(df, schema, preserve_index,
  File "C:\Users\Gregory\anaconda3\lib\site-packages\pyarrow\pandas_compat.py", line 375, in _get_columns_to_convert
    raise ValueError(
ValueError: Duplicate column names found: ['_id', 'OrganizationName', 'URL', 'DLN', 'TaxPeriod', 'EIN', 'AddressChange', 'NameOfPrincipalOfficerPerson', 'GrossReceipts', 'GroupReturnForAffiliates', 'Organization501c3', 'WebSite', 'TypeOfOrganizationCorporation', 'YearFormation', 'StateLegalDomicile', 'ActivityOrMissionDescription', 'NbrVotingMembersGoverningBody', 'NbrIndependentVotingMembers', 'TotalNbrEmployees', 'TotalNbrVolunteers', 'TotalGrossUBI', 'NetUnrelatedBusinessTxblIncome', 'ContributionsGrantsPriorYear', 'ContributionsGrantsCurrentYear', 'ProgramServiceRevenuePriorYear', 'ProgramServiceRevenueCY', 'InvestmentIncomePriorYear', 'InvestmentIncomeCurrentYear', 'OtherRevenuePriorYear', 'OtherRevenueCurrentYear', 'TotalRevenuePriorYear', 'TotalRevenueCurrentYear', 'GrantsAndSimilarAmntsPriorYear', 'GrantsAndSimilarAmntsCY', 'BenefitsPaidToMembersPriorYear', 'BenefitsPaidToMembersCY', 'SalariesEtcPriorYear', 'SalariesEtcCurrentYear', 'TotalProfFundrsngExpPriorYear', 'TotalProfFundrsngExpCY', 'TotalFundrsngExpCurrentYear', 'OtherExpensePriorYear', 'OtherExpensesCurrentYear', 'TotalExpensesPriorYear', 'TotalExpensesCurrentYear', 'RevenuesLessExpensesPriorYear', 'RevenuesLessExpensesCY', 'TotalAssetsBOY', 'TotalAssetsEOY', 'TotalLiabilitiesBOY', 'TotalLiabilitiesEOY', 'NetAssetsOrFundBalancesBOY', 'NetAssetsOrFundBalancesEOY', 'InfoInScheduleOPartIII', 'MissionDescription', 'SignificantNewProgramServices', 'SignificantChange', 'Expense', 'Grants', 'Description', 'TotalProgramServiceExpense', 'PoliticalActivities', 'LobbyingActivities', 'ProfessionalFundraising', 'FundraisingActivities', 'Gaming', 'ExcessBenefitTransaction', 'PriorExcessBenefitTransaction', 'DisregardedEntity', 'RelatedEntity', 'RelatedOrgControlledEntity', 'TransactionRelatedEntity', 'TransfersToExemptNonChrtblOrg', 'ActivitiesConductedPartnership', 'NumberFormsTransmittedWith1096', 'NumberOfEmployees', 'UnrelatedBusinessIncome', 'InfoInScheduleOPartVI', 'NbrVotingGoverningBodyMembers', 'NumberIndependentVotingMembers', 'FamilyOrBusinessRelationship', 'DelegationOfManagementDuties', 'ChangesToOrganizingDocs', 'MaterialDiversionOrMisuse', 'MembersOrStockholders', 'ElectionOfBoardMembers', 'DecisionsSubjectToApproval', 'MinutesOfGoverningBody', 'MinutesOfCommittees', 'OfficerMailingAddress', 'LocalChapters', 'Form990ProvidedToGoverningBody', 'ConflictOfInterestPolicy', 'AnnualDisclosureCoveredPersons', 'RegularMonitoringEnforcement', 'WhistleblowerPolicy', 'DocumentRetentionPolicy', 'CompensationProcessCEO', 'CompensationProcessOther', 'InvestmentInJointVenture', 'StatesWhereCopyOfReturnIsFiled', 'UponRequest', 'NoListedPersonsCompensated', 'TotalReportableCompFromOrg', 'TotalReportableCompFrmRltdOrgs', 'TotalOtherCompensation', 'NumberIndividualsGT100K', 'FormersListed', 'TotalCompGT150K', 'CompensationFromOtherSources', 'NumberOfContractorsGT100K', 'AllOtherContributions', 'TotalContributions', 'TotalOtherRevenue', 'TotalRevenue', 'GrantsToDomesticOrgs', 'GrantsToDomesticIndividuals', 'FeesForServicesLegal', 'FeesForServicesAccounting', 'OfficeExpenses', 'PaymentsToAffiliates', 'DepreciationDepletion', 'OtherExpenses', 'AllOtherExpenses', 'TotalFunctionalExpenses', 'SavingsAndTempCashInvestments', 'AccountsReceivable', 'LandBuildingsEquipmentBasis', 'LandBldgEquipmentAccumDeprec', 'LandBuildingsEquipmentBasisNet', 'InvestmentsOtherSecurities', 'TotalAssets', 'AccountsPayableAccruedExpenses', 'GrantsPayable', 'OtherLiabilities', 'FollowSFAS117', 'UnrestrictedNetAssets', 'InfoInScheduleOPartXI', 'ReconcilationRevenueExpenses', 'InfoInScheduleOPartXII', 'MethodOfAccountingAccrual', 'AccountantCompileOrReview', 'FSAudited', 'AuditCommittee', 'FederalGrantAuditRequired', 'AllAffiliatesIncluded', 'GroupExemptionNumber', 'Revenue', 'PoliciesReferenceChapters', 'WrittenPolicyOrProcedure', 'TotalProgramServiceRevenue', 'ForeignGrants', 'BenefitsToMembers', 'CompCurrentOfficersDirectors', 'CompDisqualPersons', 'OtherSalariesAndWages', 'PensionPlanContributions', 'OtherEmployeeBenefits', 'PayrollTaxes', 'FeesForServicesManagement', 'FeesForServicesLobbying', 'FeesForServicesProfFundraising', 'FeesForServicesInvstMgmntFees', 'FeesForServicesOther', 'Advertising', 'InformationTechnology', 'Royalties', 'Occupancy', 'Travel', 'TravelEntrtnmntPublicOfficials', 'ConferencesMeetings', 'Interest', 'Insurance', 'CashNonInterestBearing', 'PledgesAndGrantsReceivable', 'ReceivablesFromDisqualPersons', 'OtherNotesLoansReceivableNet', 'InventoriesForSaleOrUse', 'PrepaidExpensesDeferredCharges', 'InvestmentsPubTradedSecurities', 'InvestmentsProgramRelated', 'IntangibleAssets', 'OtherAssetsTotal', 'DeferredRevenue', 'MortNotesPyblSecuredInvestProp', 'FederalGrantAuditPerformed', 'LoansFromOfficersDirectors', 'MethodOfAccountingCash', 'Activity2', 'Activity3', 'InfoInScheduleOPartVII', 'TaxExemptBondLiabilities', 'TemporarilyRestrictedNetAssets', 'OtherWebsite', 'PermanentlyRestrictedNetAssets', 'FundraisingEvents', 'CntrbtnsRprtdFundraisingEvents', 'RelatedOrganizations', 'GrossIncomeFundraisingEvents', 'FundraisingDirectExpenses', 'FederatedCampaigns', 'GovernmentGrants', 'MethodOfAccountingOther', 'GrossSalesOfInventory', 'CostOfGoodsSold', 'DoNotFollowSFAS117', 'RetainedEarningsEndowmentEtc', 'InitialReturn', 'MembershipDues', 'GrossIncomeGaming', 'GamingDirectExpenses', 'NoncashContributions', 'InfoInScheduleOPartV', 'OwnWebsite', 'UnsecuredNotesLoansPayable', 'ActivityOther', 'TotalOfOtherProgramServiceExp', 'TotalOfOtherProgramServiceRev', 'EscrowAccountLiability', 'TotalOfOtherProgramServiceGrnt', 'TypeOfOrganizationOther', 'Organization501c', 'TypeOfOrganizationTrust', 'TypeOfOrganizationAssociation', 'CountryLegalDomicile', 'AmendedReturn', 'TypeOfOrgOtherDescription', 'TotalJointCosts', 'TerminatedReturn', 'TerminationOrContraction', 'ActivityCode', 'SpecialConditionDescription', 'Organization4947a1', 'InfoInScheduleOPartIX', 'ReconciliationUnrealizedInvest', 'ReconcilationPriorAdjustment', 'ReconcilationDonatedServices', 'ReconcilationInvestExpenses', 'InfoInScheduleOPartVIII', 'InfoInScheduleOPartX', 'PrincipalOfficerNm', 'GrossReceiptsAmt', 'GroupReturnForAffiliatesInd', 'Organization501c3Ind', 'TypeOfOrganizationCorpInd', 'FormationYr', 'LegalDomicileStateCd', 'ActivityOrMissionDesc', 'VotingMembersGoverningBodyCnt', 'VotingMembersIndependentCnt', 'TotalEmployeeCnt', 'TotalGrossUBIAmt', 'CYContributionsGrantsAmt', 'CYProgramServiceRevenueAmt', 'CYInvestmentIncomeAmt', 'CYOtherRevenueAmt', 'CYTotalRevenueAmt', 'CYGrantsAndSimilarPaidAmt', 'CYBenefitsPaidToMembersAmt', 'CYSalariesCompEmpBnftPaidAmt', 'CYTotalProfFndrsngExpnsAmt', 'CYTotalFundraisingExpenseAmt', 'CYOtherExpensesAmt', 'CYTotalExpensesAmt', 'CYRevenuesLessExpensesAmt', 'TotalAssetsBOYAmt', 'TotalAssetsEOYAmt', 'TotalLiabilitiesEOYAmt', 'NetAssetsOrFundBalancesBOYAmt', 'NetAssetsOrFundBalancesEOYAmt', 'InfoInScheduleOPartIIIInd', 'MissionDesc', 'SignificantNewProgramSrvcInd', 'SignificantChangeInd', 'Desc', 'PoliticalCampaignActyInd', 'LobbyingActivitiesInd', 'ProfessionalFundraisingInd', 'FundraisingActivitiesInd', 'GamingActivitiesInd', 'EngagedInExcessBenefitTransInd', 'PYExcessBenefitTransInd', 'DisregardedEntityInd', 'RelatedEntityInd', 'RelatedOrganizationCtrlEntInd', 'TransactionWithControlEntInd', 'TrnsfrExmptNonChrtblRltdOrgInd', 'ActivitiesConductedPrtshpInd', 'IRPDocumentCnt', 'EmployeeCnt', 'UnrelatedBusIncmOverLimitInd', 'GoverningBodyVotingMembersCnt', 'IndependentVotingMemberCnt', 'FamilyOrBusinessRlnInd', 'DelegationOfMgmtDutiesInd', 'ChangeToOrgDocumentsInd', 'MaterialDiversionOrMisuseInd', 'MembersOrStockholdersInd', 'ElectionOfBoardMembersInd', 'DecisionsSubjectToApprovaInd', 'MinutesOfGoverningBodyInd', 'MinutesOfCommitteesInd', 'OfficerMailingAddressInd', 'LocalChaptersInd', 'Form990ProvidedToGvrnBodyInd', 'ConflictOfInterestPolicyInd', 'WhistleblowerPolicyInd', 'DocumentRetentionPolicyInd', 'CompensationProcessCEOInd', 'CompensationProcessOtherInd', 'InvestmentInJointVentureInd', 'StatesWhereCopyOfReturnIsFldCd', 'NoListedPersonsCompensatedInd', 'FormerOfcrEmployeesListedInd', 'TotalCompGreaterThan150KInd', 'CompensationFromOtherSrcsInd', 'MembershipDuesAmt', 'FundraisingAmt', 'AllOtherContributionsAmt', 'TotalContributionsAmt', 'OtherRevenueTotalAmt', 'TotalRevenueGrp', 'FeesForServicesAccountingGrp', 'OfficeExpensesGrp', 'InformationTechnologyGrp', 'ConferencesMeetingsGrp', 'InsuranceGrp', 'OtherExpensesGrp', 'AllOtherExpensesGrp', 'TotalFunctionalExpensesGrp', 'CashNonInterestBearingGrp', 'TotalAssetsGrp', 'OrgDoesNotFollowSFAS117Ind', 'RtnEarnEndowmentIncmOthFndsGrp', 'ReconcilationRevenueExpnssAmt', 'MethodOfAccountingCashInd', 'AccountantCompileOrReviewInd', 'FSAuditedInd', 'FederalGrantAuditRequiredInd', 'WebsiteAddressTxt', 'TotalVolunteersCnt', 'NetUnrelatedBusTxblIncmAmt', 'PYContributionsGrantsAmt', 'PYProgramServiceRevenueAmt', 'PYInvestmentIncomeAmt', 'PYOtherRevenueAmt', 'PYTotalRevenueAmt', 'PYGrantsAndSimilarPaidAmt', 'PYBenefitsPaidToMembersAmt', 'PYSalariesCompEmpBnftPaidAmt', 'PYTotalProfFndrsngExpnsAmt', 'PYOtherExpensesAmt', 'PYTotalExpensesAmt', 'PYRevenuesLessExpensesAmt', 'TotalLiabilitiesBOYAmt', 'ExpenseAmt', 'GrantAmt', 'RevenueAmt', 'ProgSrvcAccomActy2Grp', 'ProgSrvcAccomActy3Grp', 'ProgSrvcAccomActyOtherGrp', 'TotalOtherProgSrvcGrantAmt', 'TotalProgramServiceExpensesAmt', 'InfoInScheduleOPartVIInd', 'AnnualDisclosureCoveredPrsnInd', 'RegularMonitoringEnfrcInd', 'UponRequestInd', 'TotalReportableCompFromOrgAmt', 'TotReportableCompRltdOrgAmt', 'TotalOtherCompensationAmt', 'IndivRcvdGreaterThan100KCnt', 'CntrctRcvdGreaterThan100KCnt', 'GovernmentGrantsAmt', 'TotalProgramServiceRevenueAmt', 'FundraisingGrossIncomeAmt', 'ContriRptFundraisingEventAmt', 'FundraisingDirectExpensesAmt', 'GrossSalesOfInventoryAmt', 'CostOfGoodsSoldAmt', 'GrantsToDomesticIndividualsGrp', 'CompCurrentOfcrDirectorsGrp', 'OtherSalariesAndWagesGrp', 'PensionPlanContributionsGrp', 'OtherEmployeeBenefitsGrp', 'PayrollTaxesGrp', 'FeesForServicesOtherGrp', 'AdvertisingGrp', 'TravelGrp', 'InterestGrp', 'DepreciationDepletionGrp', 'SavingsAndTempCashInvstGrp', 'AccountsReceivableGrp', 'InventoriesForSaleOrUseGrp', 'PrepaidExpensesDefrdChargesGrp', 'LandBldgEquipCostOrOtherBssAmt', 'LandBldgEquipAccumDeprecAmt', 'LandBldgEquipBasisNetGrp', 'InvestmentsOtherSecuritiesGrp', 'IntangibleAssetsGrp', 'AccountsPayableAccrExpnssGrp', 'DeferredRevenueGrp', 'MortgNotesPyblScrdInvstPropGrp', 'OtherLiabilitiesGrp', 'OrganizationFollowsSFAS117Ind', 'UnrestrictedNetAssetsGrp', 'TemporarilyRstrNetAssetsGrp', 'InfoInScheduleOPartXIInd', 'NetUnrlzdGainsLossesInvstAmt', 'InfoInScheduleOPartXIIInd', 'AuditCommitteeInd', 'AllAffiliatesIncludedInd', 'GrantsToDomesticOrgsGrp', 'ForeignGrantsGrp', 'BenefitsToMembersGrp', 'CompDisqualPersonsGrp', 'FeesForServicesManagementGrp', 'FeesForServicesLegalGrp', 'FeesForServicesLobbyingGrp', 'FeesForSrvcInvstMgmntFeesGrp', 'RoyaltiesGrp', 'OccupancyGrp', 'PymtTravelEntrtnmntPubOfclGrp', 'PaymentsToAffiliatesGrp', 'PledgesAndGrantsReceivableGrp', 'RcvblFromDisqualifiedPrsnGrp', 'OthNotesLoansReceivableNetGrp', 'InvestmentsPubTradedSecGrp', 'InvestmentsProgramRelatedGrp', 'OtherAssetsTotalGrp', 'TotalOtherProgSrvcExpenseAmt', 'InfoInScheduleOPartVInd', 'MethodOfAccountingAccrualInd', 'NoncashContributionsAmt', 'GrantsPayableGrp', 'PermanentlyRstrNetAssetsGrp', 'TaxExemptBondLiabilitiesGrp', 'EscrowAccountLiabilityGrp', 'LoansFromOfficersDirectorsGrp', 'UnsecuredNotesLoansPayableGrp', 'PriorPeriodAdjustmentsAmt', 'FederalGrantAuditPerformedInd', 'PoliciesReferenceChaptersInd', 'OtherWebsiteInd', 'AddressChangeInd', 'WrittenPolicyOrProcedureInd', 'RelatedOrganizationsAmt', 'TotalOtherProgSrvcRevenueAmt', 'OwnWebsiteInd', 'TotalJointCostsGrp', 'DonatedServicesAndUseFcltsAmt', 'LegalDomicileCountryCd', 'InfoInScheduleOPartIXInd', 'TypeOfOrganizationTrustInd', 'FinalReturnInd', 'ContractTerminationInd', 'InfoInScheduleOPartXInd', 'GroupExemptionNum', 'InfoInScheduleOPartVIIInd', 'FederatedCampaignsAmt', 'TypeOfOrganizationOtherInd', 'OtherOrganizationDsc', 'InfoInScheduleOPartVIIIInd', 'TypeOfOrganizationAssocInd', 'InitialReturnInd', 'GamingGrossIncomeAmt', 'GamingDirectExpensesAmt', 'MethodOfAccountingOtherInd', 'InvestmentExpenseAmt', 'Organization501cInd', 'Organization4947a1NotPFInd', 'AmendedReturnInd', 'SpecialConditionDesc', 'ActivityCd', 'Timestamp', 'TaxPeriodEndDate', 'TaxPeriodBeginDate', 'Officer', 'TaxYear', 'BuildTS', 'ReturnTs', 'TaxPeriodEndDt', 'TaxPeriodBeginDt', 'BusinessOfficerGrp', 'TaxYr', 'fiscal_year', 'EIN', 'Name', 'NameControl', 'Phone', 'USAddress', 'ForeignAddress', 'InCareOfName', 'BusinessName', 'BusinessNameControlTxt', 'PhoneNum', 'InCareOfNm', 'ForeignPhoneNum']

In [68]:
dupes = df.columns[df.columns.duplicated()].tolist()
print("🚨 Duplicate columns:", dupes)

🚨 Duplicate columns: []


In [102]:
# Find the positions of all columns named 'EIN'
ein_indices = [i for i, col in enumerate(df.columns) if col == "EIN"]
print("EIN column indices:", ein_indices)

EIN column indices: [5, 485]


In [104]:
df.iloc[:, ein_indices[0]].to_frame().info()
df.iloc[:, ein_indices[1]].to_frame().info()

<class 'modin.pandas.dataframe.DataFrame'>
RangeIndex: 3469008 entries, 0 to 3469007
Data columns (total 1 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   EIN     object
dtypes: object(1)
memory usage: 26.5+ MB
<class 'modin.pandas.dataframe.DataFrame'>
RangeIndex: 3469008 entries, 0 to 3469007
Data columns (total 1 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   EIN     object
dtypes: object(1)
memory usage: 26.5+ MB


In [107]:
df.iloc[:, ein_indices].head()

Unnamed: 0,EIN,EIN.1
0,232705170,232705170
1,581805618,581805618
2,581876019,581876019
3,391083432,391083432
4,205297040,205297040


In [109]:
df.iloc[:, ein_indices].info(verbose=True)

<class 'modin.pandas.dataframe.DataFrame'>
RangeIndex: 3469008 entries, 0 to 3469007
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   EIN     object
 1   EIN     object
dtypes: object(2)
memory usage: 52.9+ MB


In [110]:
ein1 = df.iloc[:, ein_indices[0]]
ein2 = df.iloc[:, ein_indices[1]]

# Check how many missing in each
print("🧼 Missing values:")
print("EIN (index 5):", ein1.isna().sum())
print("EIN (index 485):", ein2.isna().sum())

# Check if they are identical (excluding NaNs)
print("\n🔍 Mismatched values:")
mismatches = (ein1 != ein2) & ~(ein1.isna() & ein2.isna())
print("Number of mismatches:", mismatches.sum())


🧼 Missing values:
EIN (index 5): 1276573
EIN (index 485): 0

🔍 Mismatched values:
Number of mismatches: 1276573


In [112]:
# Pull out fiscal year and both EINs
fiscal_year = df["fiscal_year"]
ein_bad = df.iloc[:, 5]
ein_good = df.iloc[:, 485]

# Combine into a temporary DataFrame
tmp = pd.DataFrame({
    "fiscal_year": fiscal_year,
    "ein_bad_notna": ein_bad.notna(),
    "ein_good_notna": ein_good.notna()
})

# Count how many EINs are not missing by year
summary = tmp.groupby("fiscal_year").agg(
    bad_EIN_nonmissing=("ein_bad_notna", "sum"),
    good_EIN_nonmissing=("ein_good_notna", "sum"),
    total_rows=("fiscal_year", "count")
)

# Optional: fill rates
summary["bad_EIN_fill_rate"] = summary["bad_EIN_nonmissing"] / summary["total_rows"]
summary["good_EIN_fill_rate"] = summary["good_EIN_nonmissing"] / summary["total_rows"]

# Show result
summary

Unnamed: 0_level_0,bad_EIN_nonmissing,good_EIN_nonmissing,total_rows,bad_EIN_fill_rate,good_EIN_fill_rate
fiscal_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000,1,1,1,1.0,1.0
2001,1,2,2,0.5,1.0
2012,1,1,1,1.0,1.0
2013,103432,103432,103432,1.0,1.0
2014,210538,210538,210538,1.0,1.0
2015,228000,228000,228000,1.0,1.0
2016,240304,240304,240304,1.0,1.0
2017,251401,251414,251414,0.999948,1.0
2018,261383,261873,261873,0.998129,1.0
2019,265281,276308,276308,0.960092,1.0


#### ✅ Recommendation
Keep: the column at index 485 (the good one).

Drop: the column at index 5 (bad_EIN) — it’s only valid through ~2019 and completely broken afterward.

If needed, rename EIN to EIN_final to be explicit in future pipelines.

#### Drop the 'bad' EIN

In [117]:
df.columns[484]

'Name'

In [121]:
[c for c in df.columns if 'ein' in c.lower()]

['SignificantChangeInd',
 'MaterialDiversionOrMisuseInd',
 'InvestmentInJointVentureInd',
 'AuditCommitteeInd',
 'OtherWebsiteInd',
 'AddressChangeInd',
 'WrittenPolicyOrProcedureInd',
 'OwnWebsiteInd']

In [118]:
df[:2]

Unnamed: 0,_id,OrganizationName,URL,DLN,TaxPeriod,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,NameControl,Phone,USAddress,ForeignAddress,InCareOfName,BusinessName,BusinessNameControlTxt,PhoneNum,InCareOfNm,ForeignPhoneNum
0,5d019e6778ffca27b42818d7,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,X,MICHAEL ANTON,1473903,0,X,,RONA,8565826843,"{'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300', 'City': 'BETHLEHEM', 'State': 'PA', 'ZIPCode': '18017'}",,,,,,,
1,5d019e6778ffca27b42818d8,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,201106,,,266420,false,X,,TORR,7033415000,"{'AddressLine1': '1660 DUKE STREET', 'City': 'ALEXANDRIA', 'State': 'VA', 'ZIPCode': '22314'}",,,,,,,


In [122]:
dfx[:2]

NameError: name 'dfx' is not defined

In [113]:
#df = df.drop(columns=[df.columns[5]])

In [70]:
missing_count = df['EIN'].isna().sum()
total_count = len(df)

print(f"🧼 Missing EINs: {missing_count:,} out of {total_count:,} rows")
print(f"📉 Missing Rate: {missing_count / total_count:.2%}")

🧼 Missing EINs: 0 out of 3,469,008 rows
📉 Missing Rate: 0.00%


In [69]:
print("✅ Loaded:", df.shape)

✅ Loaded: (3469008, 496)


# Verifications

In [74]:
non_str_cols = [col for col in df.columns if not isinstance(col, str)]
print("🔍 Non-string column names:", non_str_cols)

🔍 Non-string column names: []


In [76]:
from collections import Counter

dupes = [item for item, count in Counter(df.columns).items() if count > 1]
print("🚨 Duplicate columns:", dupes)


🚨 Duplicate columns: []


In [77]:
import pandas as pd

def scan_object_columns(df, sample_size=1000):
    object_cols = df.select_dtypes(include=["object"]).columns
    suspect_cols = {}

    for col in object_cols:
        sample = df[col].dropna().sample(n=min(sample_size, len(df[col].dropna())), random_state=42)
        bad_values = sample[~sample.map(lambda x: isinstance(x, (str, int, float, bool, pd.Timestamp)))]
        if not bad_values.empty:
            suspect_cols[col] = bad_values.iloc[:5].tolist()  # Show 5 example values

    return suspect_cols

suspect = scan_object_columns(df)
print("🚨 Columns with non-stringifiable objects:")
for col, examples in suspect.items():
    print(f"• {col}: {examples}")



KeyboardInterrupt



In [79]:
from collections import defaultdict

def quick_object_type_check(df, n=100):
    bad_columns = defaultdict(set)
    object_cols = df.select_dtypes(include=["object"]).columns

    for col in object_cols:
        sample = df[col].dropna().head(n)
        for val in sample:
            bad_columns[col].add(type(val).__name__)
        # Only show if there are unexpected types
        if bad_columns[col] <= {"str", "int", "float", "bool"}:
            del bad_columns[col]

    return dict(bad_columns)

bad_obj_types = quick_object_type_check(df)
print("🚨 Object columns with unusual types:")
for col, types in bad_obj_types.items():
    print(f"• {col}: {types}")

🚨 Object columns with unusual types:
• Officer: {'dict'}
• BusinessOfficerGrp: {'dict'}
• Name: {'dict'}
• USAddress: {'dict'}
• ForeignAddress: {'dict'}
• BusinessName: {'dict'}


In [80]:
%%time
dict_cols = ['Officer', 'BusinessOfficerGrp', 'Name', 'USAddress', 'ForeignAddress', 'BusinessName']

for col in dict_cols:
    df[col] = df[col].astype(str)

CPU times: total: 219 ms
Wall time: 269 ms


In [81]:
# Re-sample and recheck
sample_df = df.head(20)._to_pandas()
print("🚨 Still has dicts:", [col for col in dict_cols if sample_df[col].map(type).eq(dict).any()])

🚨 Still has dicts: []


In [82]:
import datetime
print("🕓 Save started:", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

df.to_parquet("D:/cleaned_all_filings.parquet", engine="pyarrow", compression="snappy", index=False)

print("✅ Save complete:", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

🕓 Save started: 2025-04-15 01:16:37
✅ Save complete: 2025-04-15 01:18:17


In [84]:
df[df['OwnWebsite'].notnull()][['OwnWebsite']][:5]

Unnamed: 0,OwnWebsite
27,X
126,X
129,X
134,X
158,X


In [85]:
checkbox_cols = ['AddressChange',
 'AddressChangeInd',
 'AmendedReturn',
 'AmendedReturnInd',
 'FinalReturnInd',
 'TerminatedReturn',
 'InitialReturn',
 'InitialReturnInd',
 'Organization4947a1',
 'Organization4947a1NotPFInd',
 'Organization501c',
 'Organization501cInd',
 'Organization501c3',
 'Organization501c3Ind',
 'TypeOfOrganizationAssociation',
 'TypeOfOrganizationAssocInd',
 'TypeOfOrganizationCorpInd',
 'TypeOfOrganizationCorporation',
 'TypeOfOrganizationOther',
 'TypeOfOrganizationOtherInd',
 'TypeOfOrganizationTrust',
 'TypeOfOrganizationTrustInd',
 'ContractTerminationInd',
 'TerminationOrContraction',
 'InfoInScheduleOPartIII',
 'InfoInScheduleOPartIIIInd',
 'OwnWebsite',
 'OwnWebsiteInd',
 'OwnWebsite',
 'OwnWebsiteInd',
 'OtherWebsite',
 'OtherWebsiteInd',
 'UponRequest',
 'UponRequestInd',
 'NoListedPersonsCompensated',
 'NoListedPersonsCompensatedInd',
 'InfoInScheduleOPartIX',
 'InfoInScheduleOPartIXInd',
 'InfoInScheduleOPartX',
 'InfoInScheduleOPartXInd',
 'FollowSFAS117',
 'OrganizationFollowsSFAS117Ind',
 'DoNotFollowSFAS117',
 'OrgDoesNotFollowSFAS117Ind',
 'MethodOfAccountingAccrual',
 'MethodOfAccountingAccrualInd',
 'MethodOfAccountingCash',
 'MethodOfAccountingCashInd',
 'MethodOfAccountingOther',
 'MethodOfAccountingOtherInd',
 'InfoInScheduleOPartV',
 'InfoInScheduleOPartVInd',
 'InfoInScheduleOPartVI',
 'InfoInScheduleOPartVIInd',
 'InfoInScheduleOPartVII',
 'InfoInScheduleOPartVIIInd',
 'InfoInScheduleOPartVIII',
 'InfoInScheduleOPartVIIIInd',
 'InfoInScheduleOPartXI',
 'InfoInScheduleOPartXIInd',
 'InfoInScheduleOPartXII',
 'InfoInScheduleOPartXIIInd']

In [86]:
for col in checkbox_cols:
    if col in df.columns:
        df[col] = df[col].astype("string")

In [88]:
def find_columns_with_x_and_number(df):
    suspicious_cols = []
    for col in df.select_dtypes(include=["object"]).columns:
        unique_vals = set(df[col].dropna().unique())
        if 'X' in unique_vals:
            # If there's a number-like value alongside 'X'
            if any(isinstance(val, (int, float)) for val in unique_vals if val != 'X'):
                suspicious_cols.append(col)
    return suspicious_cols

bad_mixed_cols = find_columns_with_x_and_number(df)
print("🚨 Columns with 'X' and numbers mixed:", bad_mixed_cols)

🚨 Columns with 'X' and numbers mixed: []


In [None]:
for col in bad_mixed_cols:
    df[col] = df[col].astype("string")

# Ended here 4/14/2025

In [92]:
import pandas as pd
import pyarrow.parquet as pq

def find_bad_columns(parquet_path, column_list, depth=0):
    indent = "  " * depth
    try:
        pd.read_parquet(parquet_path, columns=column_list, engine="pyarrow")
        print(f"{indent}✅ OK: {len(column_list)} columns")
        return []
    except Exception as e:
        if len(column_list) == 1:
            print(f"{indent}❌ BAD COLUMN: {column_list[0]}")
            return [column_list[0]]
        else:
            mid = len(column_list) // 2
            print(f"{indent}🔍 Splitting {len(column_list)} columns")
            bad_left = find_bad_columns(parquet_path, column_list[:mid], depth+1)
            bad_right = find_bad_columns(parquet_path, column_list[mid:], depth+1)
            return bad_left + bad_right

# Step 1: Get all column names from the metadata (even if we can't read the file fully)
import pyarrow.parquet as pq
schema = pq.read_schema("D:/all_filings_april_2025_all_controls.parquet")
all_columns = schema.names

# Step 2: Find bad ones
bad_columns = find_bad_columns("D:/all_filings_april_2025_all_controls.parquet", all_columns)

print("\n🚨 Bad columns:")
for col in bad_columns:
    print("•", col)


PermissionError: [WinError 5] Failed to open local file 'D:/all_filings_april_2025_all_controls.parquet'. Detail: [Windows error 5] Access is denied.


In [None]:
# Identify columns containing 'X' and ensure they're saved as string type
import pandas as pd
import modin.pandas as mpd
import numpy as np
import pyarrow as pa

# Assuming your DataFrame is already loaded as 'df'

def identify_columns_with_x(df):
    """
    Identify columns that contain 'X' values which should be kept as strings
    """
    columns_with_x = []
    
    for col in df.columns:
        # Only check object/string columns or columns that might be mixed
        if df[col].dtype == 'object' or df[col].dtype == 'string':
            # Use a memory-efficient approach with chunking
            has_x = False
            for chunk_idx, chunk in enumerate(np.array_split(df[col], 100)):
                # Convert to string and check for 'X'
                if chunk.astype(str).str.contains('X').any():
                    has_x = True
                    print(f"Column '{col}' contains 'X' values (found in chunk {chunk_idx})")
                    # Get a sample of the values for inspection
                    sample = chunk[chunk.astype(str).str.contains('X')].head(3)
                    print(f"Sample values: {sample.tolist()}")
                    break
            
            if has_x:
                columns_with_x.append(col)
    
    print(f"Found {len(columns_with_x)} columns with 'X' values")
    return columns_with_x

# Find columns containing 'X'
columns_with_x = identify_columns_with_x(df)

def ensure_string_columns(df, string_columns):
    """
    Ensure that specified columns are saved as string type
    """
    # Create a copy to avoid modifying the original
    df_copy = df.copy()
    
    # Convert the specified columns to string type
    for col in string_columns:
        df_copy[col] = df_copy[col].astype(str)
        print(f"Converted '{col}' to string type")
    
    return df_copy

# Create version with explicit string types
df_fixed = ensure_string_columns(df, columns_with_x)

# Option 1: Save with explicit PyArrow schema to enforce string types
def save_with_explicit_schema(df, columns_as_string, output_path):
    """
    Save DataFrame to parquet with explicit schema that forces certain columns to be string
    """
    # Create schema where specified columns are explicitly string type
    fields = []
    for col in df.columns:
        if col in columns_as_string:
            fields.append(pa.field(col, pa.string()))
        else:
            # For other columns, infer type from data
            # This is a simplified approach - you might need more specific type mapping
            if pd.api.types.is_integer_dtype(df[col]):
                fields.append(pa.field(col, pa.int64()))
            elif pd.api.types.is_float_dtype(df[col]):
                fields.append(pa.field(col, pa.float64()))
            elif pd.api.types.is_bool_dtype(df[col]):
                fields.append(pa.field(col, pa.bool_()))
            elif pd.api.types.is_datetime64_dtype(df[col]):
                fields.append(pa.field(col, pa.timestamp('ns')))
            else:
                # Default to string for any other type
                fields.append(pa.field(col, pa.string()))
    
    schema = pa.schema(fields)
    print(f"Created explicit schema with {len(columns_as_string)} forced string columns")
    
    # Convert to pandas for pyarrow compatibility if using modin
    if 'modin.pandas' in str(type(df)):
        print("Converting from Modin to pandas for PyArrow compatibility")
        df = df._to_pandas()
    
    # Save with explicit schema
    table = pa.Table.from_pandas(df, schema=schema)
    pa.parquet.write_table(table, output_path)
    print(f"Saved parquet file with explicit schema to {output_path}")

# Option 2: Simpler approach without explicit schema
def save_strings_as_strings(df, output_path):
    """
    Save DataFrame to parquet with string type hint
    """
    # Convert to pandas if using modin
    if 'modin.pandas' in str(type(df)):
        print("Converting from Modin to pandas for saving")
        df = df._to_pandas()
    
    # Save with string type hint
    df.to_parquet(
        output_path,
        engine='pyarrow',
        # Use this option to preserve string types instead of optimizing
        # The default behavior often tries to optimize storage by inferring more specific types
        use_dictionary=False
    )
    print(f"Saved parquet file with string type hint to {output_path}")

# Example usage:
# 1. Using explicit schema (more control but more complex)
# save_with_explicit_schema(df_fixed, columns_with_x, 'fixed_explicit_schema.parquet')

# 2. Using simpler approach
# save_strings_as_strings(df_fixed, 'fixed_as_strings.parquet')

# TESTING: Verify the file can be read back correctly
def test_parquet_file(file_path):
    """
    Test that the parquet file can be read without errors and check column types
    """
    try:
        test_df = pd.read_parquet(file_path)
        print(f"Successfully read {file_path}")
        
        # Check types of columns that should be strings
        for col in columns_with_x:
            print(f"Column '{col}' type: {test_df[col].dtype}")
            # Sample values to verify 'X' is preserved
            if col in test_df.columns:
                sample = test_df[col].astype(str).str.contains('X')
                if sample.any():
                    print(f"'X' values preserved in column '{col}'")
                else:
                    print(f"WARNING: No 'X' values found in column '{col}' after reading")
            else:
                print(f"WARNING: Column '{col}' not found in read DataFrame")
    except Exception as e:
        print(f"Error reading {file_path}: {e}")

# Example test:
# test_parquet_file('fixed_explicit_schema.parquet')

#### Shorter version

In [None]:
# Simple approach to find columns containing 'X'
import pandas as pd
import modin.pandas as mpd

# Assuming your DataFrame is already loaded as 'df'

def find_columns_with_x(df, sample_size=10000):
    """
    Find columns that contain 'X' by taking a sample of rows.
    
    Parameters:
    -----------
    df : DataFrame
        The DataFrame to check
    sample_size : int
        Number of rows to sample (default 10000)
    
    Returns:
    --------
    list
        Columns that contain 'X' values
    """
    # Take a sample to reduce memory usage
    sample_df = df.sample(min(sample_size, len(df)))
    
    # Convert the sample to strings for checking
    columns_with_x = []
    
    for col in sample_df.columns:
        # Check if column has any 'X' values
        if (sample_df[col].astype(str) == 'X').any():
            print(f"Column '{col}' contains 'X' values")
            columns_with_x.append(col)
        elif sample_df[col].astype(str).str.contains('X').any():
            print(f"Column '{col}' contains strings with 'X' in them")
            columns_with_x.append(col)
    
    print(f"\nFound {len(columns_with_x)} columns with 'X' values")
    return columns_with_x

# Find columns with 'X'
x_columns = find_columns_with_x(df)
print("\nList of columns with 'X':")
for col in x_columns:
    print(f"- {col}")

In [94]:
x_columns = ['OrganizationName',
 'AddressChange',
 'NameOfPrincipalOfficerPerson',
 'Organization501c3',
 'WebSite',
 'TypeOfOrganizationCorporation',
 'StateLegalDomicile',
 'ActivityOrMissionDescription',
 'InfoInScheduleOPartIII',
 'MissionDescription',
 'Description',
 'InfoInScheduleOPartVI',
 'StatesWhereCopyOfReturnIsFiled',
 'UponRequest',
 'NoListedPersonsCompensated',
 'OtherExpenses',
 'FollowSFAS117',
 'InfoInScheduleOPartXI',
 'InfoInScheduleOPartXII',
 'MethodOfAccountingAccrual',
 'MethodOfAccountingCash',
 'Activity2',
 'Activity3',
 'InfoInScheduleOPartVII',
 'OtherWebsite',
 'MethodOfAccountingOther',
 'DoNotFollowSFAS117',
 'InitialReturn',
 'InfoInScheduleOPartV',
 'OwnWebsite',
 'ActivityOther',
 'TypeOfOrganizationOther',
 'Organization501c',
 'TypeOfOrganizationTrust',
 'TypeOfOrganizationAssociation',
 'AmendedReturn',
 'TerminatedReturn',
 'TerminationOrContraction',
 'Organization4947a1',
 'InfoInScheduleOPartIX',
 'InfoInScheduleOPartVIII',
 'InfoInScheduleOPartX',
 'PrincipalOfficerNm',
 'Organization501c3Ind',
 'TypeOfOrganizationCorpInd',
 'LegalDomicileStateCd',
 'ActivityOrMissionDesc',
 'InfoInScheduleOPartIIIInd',
 'MissionDesc',
 'Desc',
 'ProfessionalFundraisingInd',
 'FundraisingActivitiesInd',
 'GamingActivitiesInd',
 'PYExcessBenefitTransInd',
 'StatesWhereCopyOfReturnIsFldCd',
 'NoListedPersonsCompensatedInd',
 'OtherExpensesGrp',
 'OrgDoesNotFollowSFAS117Ind',
 'MethodOfAccountingCashInd',
 'WebsiteAddressTxt',
 'ProgSrvcAccomActy2Grp',
 'ProgSrvcAccomActy3Grp',
 'ProgSrvcAccomActyOtherGrp',
 'InfoInScheduleOPartVIInd',
 'UponRequestInd',
 'OrganizationFollowsSFAS117Ind',
 'InfoInScheduleOPartXIInd',
 'InfoInScheduleOPartXIIInd',
 'InfoInScheduleOPartVInd',
 'MethodOfAccountingAccrualInd',
 'OtherWebsiteInd',
 'AddressChangeInd',
 'OwnWebsiteInd',
 'InfoInScheduleOPartIXInd',
 'TypeOfOrganizationTrustInd',
 'FinalReturnInd',
 'ContractTerminationInd',
 'InfoInScheduleOPartXInd',
 'InfoInScheduleOPartVIIInd',
 'TypeOfOrganizationOtherInd',
 'OtherOrganizationDsc',
 'InfoInScheduleOPartVIIIInd',
 'TypeOfOrganizationAssocInd',
 'InitialReturnInd',
 'MethodOfAccountingOtherInd',
 'Organization501cInd',
 'Organization4947a1NotPFInd',
 'AmendedReturnInd',
 'SpecialConditionDesc',
 'Officer',
 'BusinessOfficerGrp',
 'Name',
 'NameControl',
 'USAddress',
 'ForeignAddress',
 'InCareOfName',
 'BusinessName',
 'BusinessNameControlTxt',
 'InCareOfNm']

In [99]:
df[x_columns].dtypes[:50]

OrganizationName                          object
AddressChange                     string[python]
NameOfPrincipalOfficerPerson              object
Organization501c3                 string[python]
WebSite                                   object
TypeOfOrganizationCorporation     string[python]
StateLegalDomicile                        object
ActivityOrMissionDescription              object
InfoInScheduleOPartIII            string[python]
MissionDescription                        object
Description                               object
InfoInScheduleOPartVI             string[python]
StatesWhereCopyOfReturnIsFiled            object
UponRequest                       string[python]
NoListedPersonsCompensated        string[python]
OtherExpenses                             object
FollowSFAS117                     string[python]
InfoInScheduleOPartXI             string[python]
InfoInScheduleOPartXII            string[python]
MethodOfAccountingAccrual         string[python]
MethodOfAccountingCa

In [None]:
df_fixed = df.copy()
    
# Convert columns with 'X' to string type
for col in x_columns:
    df_fixed[col] = df_fixed[col].astype(str)
    print(f"Converted '{col}' to string type")

In [None]:
# To ensure these columns are saved as strings, simply convert them:
def fix_and_save_parquet(df, string_columns, output_path='fixed_dataframe.parquet'):
    """
    Convert specified columns to string type and save to parquet
    """
    # Make a copy of the dataframe
    df_fixed = df.copy()
    
    # Convert columns with 'X' to string type
    for col in string_columns:
        df_fixed[col] = df_fixed[col].astype(str)
        print(f"Converted '{col}' to string type")
    
    # Save to parquet with string_as_string option
    # If using modin, convert to pandas first
    if 'modin.pandas' in str(type(df_fixed)):
        df_fixed = df_fixed._to_pandas()
    
    # Save with options to preserve string type
    df_fixed.to_parquet(
        output_path,
        engine='pyarrow',
        # These options help ensure strings stay as strings
        use_dictionary=False
    )
    print(f"Saved to {output_path}")
    
    return df_fixed

# Example usage:
# fix_and_save_parquet(df, x_columns)

In [97]:
%%time
import datetime
print("🕓 Save started:", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
# When saving, explicitly disable type inference and optimization
df.to_parquet('D:/all_filings_april_2025_all_controls_alt.parquet', 
              engine='pyarrow',
              compression='snappy',  # Optional, for better performance
              use_dictionary=False,  # Help prevent type coercion
              write_statistics=False)  # Disable statistics that can influence type inference
print("✅ Save completed:", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

🕓 Save started: 2025-04-16 16:04:23
[33m(raylet)[0m A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 3575c7506796bdfbe0d5408b7e0b58077509c70001000000 Worker ID: 8f8a196deefeca60a1f8bd4d3cc8f53c58ebb04b988bb42c75de27fb Node ID: 240b123105390241729b6566d46242248cbc6c123f90e09733311551 Worker IP address: 127.0.0.1 Worker port: 53008 Worker PID: 14880 Worker exit type: SYSTEM_ERROR Worker exit detail: The leased worker has unrecoverable failure. Worker is requested to be destroyed when it is returned. RPC Error message: keepalive watchdog timeout; RPC Error details: 
[33m(raylet)[0m A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: e15760bace1c477db449c3fcf3d15242e975683401000000 Worker ID: 0b051d049ac1bb474f189c0acd93b5c8fcbfc1c5823264d47cdbc6f9 Node ID: 240b12310539024

In [None]:
import pyarrow as pa

# Create a schema where all object/string columns are explicitly string type
schema = pa.schema([
    pa.field(col, pa.string()) if df[col].dtype == 'object' or 'string' in str(df[col].dtype) 
    else pa.field(col, None)  # Let PyArrow infer types for non-string columns
    for col in df.columns
])

# Convert to pandas if using modin
if 'modin.pandas' in str(type(df)):
    pd_df = df._to_pandas()
else:
    pd_df = df

# Save with explicit schema
table = pa.Table.from_pandas(pd_df, schema=schema)
pa.parquet.write_table(table, 'D:/all_filings_april_2025_all_controls_alt_v2.parquet')

#### Save DF

In [87]:
%%time
import datetime
print("🕓 Save started:", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
# Step 3: Save
#df_clean.to_parquet("D:/filings_full.parquet", engine="pyarrow", compression="snappy", index=False)
df.to_parquet("D:/all_filings_april_2025_all_controls.parquet", engine="pyarrow", compression="snappy", index=False)

print("✅ Save completed:", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

🕓 Save started: 2025-04-15 01:27:29
✅ Save completed: 2025-04-15 01:29:57
CPU times: total: 36.4 s
Wall time: 2min 27s


In [90]:
pwd

'C:\\Users\\Gregory\\IRS 990 Control Variables'

In [89]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df.to_pickle('all_filings_april_2025_all_controls.pkl.gz', compression='gzip')
print('# of columns:', len(df.columns))
print('# of observations:', len(df))
df[:2]

Current date and time :  2025-04-15 01:51:40 





# of columns: 496
# of observations: 3469008
CPU times: total: 52min 38s
Wall time: 57min 18s


Unnamed: 0,_id,OrganizationName,URL,DLN,TaxPeriod,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,NameControl,Phone,USAddress,ForeignAddress,InCareOfName,BusinessName,BusinessNameControlTxt,PhoneNum,InCareOfNm,ForeignPhoneNum
0,5d019e6778ffca27b42818d7,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,X,MICHAEL ANTON,1473903,0,X,,RONA,8565826843,"{'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300', 'City': 'BETHLEHEM', 'State': 'PA', 'ZIPCode': '18017'}",,,,,,,
1,5d019e6778ffca27b42818d8,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,201106,,,266420,false,X,,TORR,7033415000,"{'AddressLine1': '1660 DUKE STREET', 'City': 'ALEXANDRIA', 'State': 'VA', 'ZIPCode': '22314'}",,,,,,,


In [91]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df.to_feather("D:/all_filings.feather")
#df = pd.read_feather("D:/all_filings.feather")

Current date and time :  2025-04-16 01:03:54 





CPU times: total: 12min 3s
Wall time: 13min 7s


# Verifications

In [18]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 500)

In [5]:
cd "C:\\Users\\Gregory\\IRS 990 Control Variables\\"

C:\Users\Gregory\IRS 990 Control Variables


In [6]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')
df = pd.read_pickle('all_filings_april_2025_all_controls.pkl.gz', compression='gzip')
print('# of columns:', len(df.columns))
print('# of observations:', len(df))
df[:2]

Current date and time :  2025-04-17 15:38:57 

# of columns: 496
# of observations: 3469008
CPU times: total: 4min 56s
Wall time: 5min 45s


Unnamed: 0,_id,OrganizationName,URL,DLN,TaxPeriod,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,TypeOfOrganizationCorporation,YearFormation,StateLegalDomicile,ActivityOrMissionDescription,NbrVotingMembersGoverningBody,NbrIndependentVotingMembers,TotalNbrEmployees,TotalNbrVolunteers,TotalGrossUBI,NetUnrelatedBusinessTxblIncome,ContributionsGrantsPriorYear,ContributionsGrantsCurrentYear,ProgramServiceRevenuePriorYear,ProgramServiceRevenueCY,InvestmentIncomePriorYear,InvestmentIncomeCurrentYear,OtherRevenuePriorYear,OtherRevenueCurrentYear,TotalRevenuePriorYear,TotalRevenueCurrentYear,GrantsAndSimilarAmntsPriorYear,GrantsAndSimilarAmntsCY,BenefitsPaidToMembersPriorYear,BenefitsPaidToMembersCY,SalariesEtcPriorYear,SalariesEtcCurrentYear,TotalProfFundrsngExpPriorYear,TotalProfFundrsngExpCY,TotalFundrsngExpCurrentYear,OtherExpensePriorYear,OtherExpensesCurrentYear,TotalExpensesPriorYear,TotalExpensesCurrentYear,RevenuesLessExpensesPriorYear,RevenuesLessExpensesCY,TotalAssetsBOY,TotalAssetsEOY,TotalLiabilitiesBOY,TotalLiabilitiesEOY,NetAssetsOrFundBalancesBOY,NetAssetsOrFundBalancesEOY,InfoInScheduleOPartIII,MissionDescription,SignificantNewProgramServices,SignificantChange,Expense,Grants,Description,TotalProgramServiceExpense,PoliticalActivities,LobbyingActivities,ProfessionalFundraising,FundraisingActivities,Gaming,ExcessBenefitTransaction,PriorExcessBenefitTransaction,DisregardedEntity,RelatedEntity,RelatedOrgControlledEntity,TransactionRelatedEntity,TransfersToExemptNonChrtblOrg,ActivitiesConductedPartnership,NumberFormsTransmittedWith1096,NumberOfEmployees,UnrelatedBusinessIncome,InfoInScheduleOPartVI,NbrVotingGoverningBodyMembers,NumberIndependentVotingMembers,FamilyOrBusinessRelationship,DelegationOfManagementDuties,ChangesToOrganizingDocs,MaterialDiversionOrMisuse,MembersOrStockholders,ElectionOfBoardMembers,DecisionsSubjectToApproval,MinutesOfGoverningBody,MinutesOfCommittees,OfficerMailingAddress,LocalChapters,Form990ProvidedToGoverningBody,ConflictOfInterestPolicy,AnnualDisclosureCoveredPersons,RegularMonitoringEnforcement,WhistleblowerPolicy,DocumentRetentionPolicy,CompensationProcessCEO,CompensationProcessOther,InvestmentInJointVenture,StatesWhereCopyOfReturnIsFiled,UponRequest,NoListedPersonsCompensated,TotalReportableCompFromOrg,TotalReportableCompFrmRltdOrgs,TotalOtherCompensation,NumberIndividualsGT100K,FormersListed,TotalCompGT150K,CompensationFromOtherSources,NumberOfContractorsGT100K,AllOtherContributions,TotalContributions,TotalOtherRevenue,TotalRevenue,GrantsToDomesticOrgs,GrantsToDomesticIndividuals,FeesForServicesLegal,FeesForServicesAccounting,OfficeExpenses,PaymentsToAffiliates,DepreciationDepletion,OtherExpenses,AllOtherExpenses,TotalFunctionalExpenses,SavingsAndTempCashInvestments,AccountsReceivable,LandBuildingsEquipmentBasis,LandBldgEquipmentAccumDeprec,LandBuildingsEquipmentBasisNet,InvestmentsOtherSecurities,TotalAssets,AccountsPayableAccruedExpenses,GrantsPayable,OtherLiabilities,FollowSFAS117,UnrestrictedNetAssets,InfoInScheduleOPartXI,ReconcilationRevenueExpenses,InfoInScheduleOPartXII,MethodOfAccountingAccrual,AccountantCompileOrReview,FSAudited,AuditCommittee,FederalGrantAuditRequired,AllAffiliatesIncluded,GroupExemptionNumber,Revenue,PoliciesReferenceChapters,WrittenPolicyOrProcedure,TotalProgramServiceRevenue,ForeignGrants,BenefitsToMembers,CompCurrentOfficersDirectors,CompDisqualPersons,OtherSalariesAndWages,PensionPlanContributions,OtherEmployeeBenefits,PayrollTaxes,FeesForServicesManagement,FeesForServicesLobbying,FeesForServicesProfFundraising,FeesForServicesInvstMgmntFees,FeesForServicesOther,Advertising,InformationTechnology,Royalties,Occupancy,Travel,TravelEntrtnmntPublicOfficials,ConferencesMeetings,Interest,Insurance,CashNonInterestBearing,PledgesAndGrantsReceivable,ReceivablesFromDisqualPersons,OtherNotesLoansReceivableNet,InventoriesForSaleOrUse,PrepaidExpensesDeferredCharges,InvestmentsPubTradedSecurities,InvestmentsProgramRelated,IntangibleAssets,OtherAssetsTotal,DeferredRevenue,MortNotesPyblSecuredInvestProp,FederalGrantAuditPerformed,LoansFromOfficersDirectors,MethodOfAccountingCash,Activity2,Activity3,InfoInScheduleOPartVII,TaxExemptBondLiabilities,TemporarilyRestrictedNetAssets,OtherWebsite,PermanentlyRestrictedNetAssets,FundraisingEvents,CntrbtnsRprtdFundraisingEvents,RelatedOrganizations,GrossIncomeFundraisingEvents,FundraisingDirectExpenses,FederatedCampaigns,GovernmentGrants,MethodOfAccountingOther,GrossSalesOfInventory,CostOfGoodsSold,DoNotFollowSFAS117,RetainedEarningsEndowmentEtc,InitialReturn,MembershipDues,GrossIncomeGaming,GamingDirectExpenses,NoncashContributions,InfoInScheduleOPartV,OwnWebsite,UnsecuredNotesLoansPayable,ActivityOther,TotalOfOtherProgramServiceExp,TotalOfOtherProgramServiceRev,EscrowAccountLiability,TotalOfOtherProgramServiceGrnt,TypeOfOrganizationOther,Organization501c,TypeOfOrganizationTrust,TypeOfOrganizationAssociation,CountryLegalDomicile,AmendedReturn,TypeOfOrgOtherDescription,TotalJointCosts,TerminatedReturn,TerminationOrContraction,ActivityCode,SpecialConditionDescription,Organization4947a1,InfoInScheduleOPartIX,ReconciliationUnrealizedInvest,ReconcilationPriorAdjustment,ReconcilationDonatedServices,ReconcilationInvestExpenses,InfoInScheduleOPartVIII,InfoInScheduleOPartX,PrincipalOfficerNm,GrossReceiptsAmt,GroupReturnForAffiliatesInd,Organization501c3Ind,TypeOfOrganizationCorpInd,FormationYr,LegalDomicileStateCd,ActivityOrMissionDesc,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,TotalEmployeeCnt,TotalGrossUBIAmt,CYContributionsGrantsAmt,CYProgramServiceRevenueAmt,CYInvestmentIncomeAmt,CYOtherRevenueAmt,CYTotalRevenueAmt,CYGrantsAndSimilarPaidAmt,CYBenefitsPaidToMembersAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalProfFndrsngExpnsAmt,CYTotalFundraisingExpenseAmt,CYOtherExpensesAmt,CYTotalExpensesAmt,CYRevenuesLessExpensesAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalLiabilitiesEOYAmt,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,InfoInScheduleOPartIIIInd,MissionDesc,SignificantNewProgramSrvcInd,SignificantChangeInd,Desc,PoliticalCampaignActyInd,LobbyingActivitiesInd,ProfessionalFundraisingInd,FundraisingActivitiesInd,GamingActivitiesInd,EngagedInExcessBenefitTransInd,PYExcessBenefitTransInd,DisregardedEntityInd,RelatedEntityInd,RelatedOrganizationCtrlEntInd,TransactionWithControlEntInd,TrnsfrExmptNonChrtblRltdOrgInd,ActivitiesConductedPrtshpInd,IRPDocumentCnt,EmployeeCnt,UnrelatedBusIncmOverLimitInd,GoverningBodyVotingMembersCnt,IndependentVotingMemberCnt,FamilyOrBusinessRlnInd,DelegationOfMgmtDutiesInd,ChangeToOrgDocumentsInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,ElectionOfBoardMembersInd,DecisionsSubjectToApprovaInd,MinutesOfGoverningBodyInd,MinutesOfCommitteesInd,OfficerMailingAddressInd,LocalChaptersInd,Form990ProvidedToGvrnBodyInd,ConflictOfInterestPolicyInd,WhistleblowerPolicyInd,DocumentRetentionPolicyInd,CompensationProcessCEOInd,CompensationProcessOtherInd,InvestmentInJointVentureInd,StatesWhereCopyOfReturnIsFldCd,NoListedPersonsCompensatedInd,FormerOfcrEmployeesListedInd,TotalCompGreaterThan150KInd,CompensationFromOtherSrcsInd,MembershipDuesAmt,FundraisingAmt,AllOtherContributionsAmt,TotalContributionsAmt,OtherRevenueTotalAmt,TotalRevenueGrp,FeesForServicesAccountingGrp,OfficeExpensesGrp,InformationTechnologyGrp,ConferencesMeetingsGrp,InsuranceGrp,OtherExpensesGrp,AllOtherExpensesGrp,TotalFunctionalExpensesGrp,CashNonInterestBearingGrp,TotalAssetsGrp,OrgDoesNotFollowSFAS117Ind,RtnEarnEndowmentIncmOthFndsGrp,ReconcilationRevenueExpnssAmt,MethodOfAccountingCashInd,AccountantCompileOrReviewInd,FSAuditedInd,FederalGrantAuditRequiredInd,WebsiteAddressTxt,TotalVolunteersCnt,NetUnrelatedBusTxblIncmAmt,PYContributionsGrantsAmt,PYProgramServiceRevenueAmt,PYInvestmentIncomeAmt,PYOtherRevenueAmt,PYTotalRevenueAmt,PYGrantsAndSimilarPaidAmt,PYBenefitsPaidToMembersAmt,PYSalariesCompEmpBnftPaidAmt,PYTotalProfFndrsngExpnsAmt,PYOtherExpensesAmt,PYTotalExpensesAmt,PYRevenuesLessExpensesAmt,TotalLiabilitiesBOYAmt,ExpenseAmt,GrantAmt,RevenueAmt,ProgSrvcAccomActy2Grp,ProgSrvcAccomActy3Grp,ProgSrvcAccomActyOtherGrp,TotalOtherProgSrvcGrantAmt,TotalProgramServiceExpensesAmt,InfoInScheduleOPartVIInd,AnnualDisclosureCoveredPrsnInd,RegularMonitoringEnfrcInd,UponRequestInd,TotalReportableCompFromOrgAmt,TotReportableCompRltdOrgAmt,TotalOtherCompensationAmt,IndivRcvdGreaterThan100KCnt,CntrctRcvdGreaterThan100KCnt,GovernmentGrantsAmt,TotalProgramServiceRevenueAmt,FundraisingGrossIncomeAmt,ContriRptFundraisingEventAmt,FundraisingDirectExpensesAmt,GrossSalesOfInventoryAmt,CostOfGoodsSoldAmt,GrantsToDomesticIndividualsGrp,CompCurrentOfcrDirectorsGrp,OtherSalariesAndWagesGrp,PensionPlanContributionsGrp,OtherEmployeeBenefitsGrp,PayrollTaxesGrp,FeesForServicesOtherGrp,AdvertisingGrp,TravelGrp,InterestGrp,DepreciationDepletionGrp,SavingsAndTempCashInvstGrp,AccountsReceivableGrp,InventoriesForSaleOrUseGrp,PrepaidExpensesDefrdChargesGrp,LandBldgEquipCostOrOtherBssAmt,LandBldgEquipAccumDeprecAmt,LandBldgEquipBasisNetGrp,InvestmentsOtherSecuritiesGrp,IntangibleAssetsGrp,AccountsPayableAccrExpnssGrp,DeferredRevenueGrp,MortgNotesPyblScrdInvstPropGrp,OtherLiabilitiesGrp,OrganizationFollowsSFAS117Ind,UnrestrictedNetAssetsGrp,TemporarilyRstrNetAssetsGrp,InfoInScheduleOPartXIInd,NetUnrlzdGainsLossesInvstAmt,InfoInScheduleOPartXIIInd,AuditCommitteeInd,AllAffiliatesIncludedInd,GrantsToDomesticOrgsGrp,ForeignGrantsGrp,BenefitsToMembersGrp,CompDisqualPersonsGrp,FeesForServicesManagementGrp,FeesForServicesLegalGrp,FeesForServicesLobbyingGrp,FeesForSrvcInvstMgmntFeesGrp,RoyaltiesGrp,OccupancyGrp,PymtTravelEntrtnmntPubOfclGrp,PaymentsToAffiliatesGrp,PledgesAndGrantsReceivableGrp,RcvblFromDisqualifiedPrsnGrp,OthNotesLoansReceivableNetGrp,InvestmentsPubTradedSecGrp,InvestmentsProgramRelatedGrp,OtherAssetsTotalGrp,TotalOtherProgSrvcExpenseAmt,InfoInScheduleOPartVInd,MethodOfAccountingAccrualInd,NoncashContributionsAmt,GrantsPayableGrp,PermanentlyRstrNetAssetsGrp,TaxExemptBondLiabilitiesGrp,EscrowAccountLiabilityGrp,LoansFromOfficersDirectorsGrp,UnsecuredNotesLoansPayableGrp,PriorPeriodAdjustmentsAmt,FederalGrantAuditPerformedInd,PoliciesReferenceChaptersInd,OtherWebsiteInd,AddressChangeInd,WrittenPolicyOrProcedureInd,RelatedOrganizationsAmt,TotalOtherProgSrvcRevenueAmt,OwnWebsiteInd,TotalJointCostsGrp,DonatedServicesAndUseFcltsAmt,LegalDomicileCountryCd,InfoInScheduleOPartIXInd,TypeOfOrganizationTrustInd,FinalReturnInd,ContractTerminationInd,InfoInScheduleOPartXInd,GroupExemptionNum,InfoInScheduleOPartVIIInd,FederatedCampaignsAmt,TypeOfOrganizationOtherInd,OtherOrganizationDsc,InfoInScheduleOPartVIIIInd,TypeOfOrganizationAssocInd,InitialReturnInd,GamingGrossIncomeAmt,GamingDirectExpensesAmt,MethodOfAccountingOtherInd,InvestmentExpenseAmt,Organization501cInd,Organization4947a1NotPFInd,AmendedReturnInd,SpecialConditionDesc,ActivityCd,Timestamp,TaxPeriodEndDate,TaxPeriodBeginDate,Officer,TaxYear,BuildTS,ReturnTs,TaxPeriodEndDt,TaxPeriodBeginDt,BusinessOfficerGrp,TaxYr,fiscal_year,EIN,Name,NameControl,Phone,USAddress,ForeignAddress,InCareOfName,BusinessName,BusinessNameControlTxt,PhoneNum,InCareOfNm,ForeignPhoneNum
0,5d019e6778ffca27b42818d7,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,X,MICHAEL ANTON,1473903,0,X,,X,1992,PA,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,10,10,0,0.0,0,0.0,1044925.0,1439340,0,0,30447,33563,0.0,1000,1075372,1473903,638637.0,925000,0.0,0,0,0,0.0,0,195892,243131,459751,881768,1384751,193604,89152,1925215,2440859,171810,450430,1753405,1990429,X,"THE CORPORATION IS ORGANIZED AND WILL BE OPERATED EXCLUSIVELY FOR CHARITABLE, EDUCATIONAL AND SCIENTIFIC PURPOSES WITHIN THE MEANING OF SECTION 501(C)(3) OF THE INTERNAL REVENUE CODE. SUCH PURPOSES SHALL BE LIMITED TO PROVIDING SUPPORT AND FUNDIN...",0,0,1043744,925000.0,"RMHC OF THE PHILADELPHIA REGION, INC. GRANTS HUNDREDS OF THOUSANDS OF DOLLARS PER YEAR TO SUPPORT NON-PROFIT PROGRAMS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN. LOCALLY, RMHC SUPPORTS THE PHILADELPHIA, SOUTHERN NEW JERSEY AND DE...",1043744,"""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""",0,0,0,X,10,10,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,0,0,"[""PA"", ""NJ"", ""DE""]",X,X,0.0,0,0,0,0,0,0,0,1439340.0,1439340,1000,"{""TotalRevenueColumn"": ""1473903"", ""RelatedOrExemptFunctionIncome"": ""1000"", ""UnrelatedBusinessRevenue"": ""0"", ""ExclusionAmount"": ""33563""}","{""Total"": ""892000"", ""ProgramServices"": ""892000""}","{""Total"": ""33000"", ""ProgramServices"": ""33000""}","{""Total"": ""215"", ""ManagementAndGeneral"": ""215""}","{""Total"": ""21675"", ""ManagementAndGeneral"": ""21675""}","{""Total"": ""123"", ""ManagementAndGeneral"": ""123""}","{""Total"": ""118744"", ""ProgramServices"": ""118744""}","{""Total"": ""86228"", ""ManagementAndGeneral"": ""86228""}","[{""Description"": ""FUNDRAISING COSTS"", ""Total"": ""108311"", ""Fundraising"": ""108311""}, {""Description"": ""CANISTER COLLECTION FEE"", ""Total"": ""81925"", ""Fundraising"": ""81925""}, {""Description"": ""PR/ADMINISTRATIVE SERVI"", ""Total"": ""34517"", ""ManagementAndGe...","{""Total"": ""763"", ""ManagementAndGeneral"": ""763""}","{""Total"": ""1384751"", ""ProgramServices"": ""1043744"", ""ManagementAndGeneral"": ""145115"", ""Fundraising"": ""195892""}","{""BOY"": ""332660"", ""EOY"": ""270700""}","{""BOY"": ""103412"", ""EOY"": ""147981""}",256845,86228,"{""BOY"": ""0"", ""EOY"": ""170617""}","{""BOY"": ""1489143"", ""EOY"": ""1851561""}","{""BOY"": ""1925215"", ""EOY"": ""2440859""}","{""BOY"": ""39670"", ""EOY"": ""44353""}","{""BOY"": ""80500"", ""EOY"": ""166000""}","{""BOY"": ""51640"", ""EOY"": ""240077""}",X,"{""BOY"": ""1753405"", ""EOY"": ""1990429""}",X,89152,X,X,0,1,1,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2011-11-09T06:41:09-06:00,2010-12-31,2010-01-01,"{'Name': 'ROBERT TRAA', 'Title': 'TREASURER', 'Phone': '8565826843', 'DateSigned': '2011-11-04', 'AuthorizeThirdParty': '1'}",2010,2016-02-24 21:20:13Z,,,,,,,232705170,"{'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}",RONA,8565826843,"{'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300', 'City': 'BETHLEHEM', 'State': 'PA', 'ZIPCode': '18017'}",,,,,,,
1,5d019e6778ffca27b42818d8,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,201106,,,266420,false,X,,X,1993,WY,PROVIDE HOUSING FOR THE ELDERLY AND THE DISABLED UNDER SECTION 202 OF THE NATIONAL HOUSING ACT UNDER AN AGREEMENT WITH THE DEPARTMENT OF HUD.,19,13,0,,0,,,0,222839,265592,1425,828,,0,224264,266420,,0,,0,71405,82955,,0,0,189785,222550,261190,305505,-36926,-39085,1455332,1433342,17482,34577,1437850,1398765,,PROVIDE HOUSING FOR THE ELDERLY AND THE DISABLED UNDER SECTION 202 OF THE NATIONAL HOUSING ACT UNDER AN AGREEMENT WITH THE DEPARTMENT OF HUD.,false,false,276405,,PROVIDE HOUSING FOR THE ELDERLY AND THE DISABLED UNDER SECTION 202 OF THE NATIONAL HOUSING ACT UNDER AN AGREEMENT WITH THE DEPARTMENT OF HUD.,276405,"""false""","""false""","""false""","""false""","""false""","""false""","""false""","{""@referenceDocumentId"": "" IRS990ScheduleR"", ""#text"": ""true""}","{""@referenceDocumentId"": "" IRS990ScheduleR"", ""#text"": ""true""}","""false""","{""@referenceDocumentId"": "" IRS990ScheduleR"", ""#text"": ""false""}","{""@referenceDocumentId"": "" IRS990ScheduleR"", ""#text"": ""false""}","{""@referenceDocumentId"": "" IRS990ScheduleR"", ""#text"": ""false""}",0,0,false,X,19,13,true,true,false,false,false,true,true,true,true,false,false,true,true,true,true,false,false,true,true,false,,X,,,1180355,411648,0,true,true,false,0,,0,0,"{""TotalRevenueColumn"": ""266420"", ""RelatedOrExemptFunctionIncome"": ""266420""}","{""Total"": ""0""}","{""Total"": ""0""}","{""Total"": ""0""}","{""Total"": ""7500"", ""ManagementAndGeneral"": ""7500""}","{""Total"": ""14222"", ""ProgramServices"": ""14222""}","{""Total"": ""0""}","{""Total"": ""66166"", ""ProgramServices"": ""66166""}","[{""Description"": ""OPER. & MAINT."", ""Total"": ""46164"", ""ProgramServices"": ""46164""}, {""Description"": ""MISC TAXES"", ""Total"": ""298"", ""ProgramServices"": ""298""}, {""Description"": ""ADMINISTRATIVE"", ""Total"": ""12176"", ""ProgramServices"": ""12176""}]","{""Total"": ""0""}","{""Total"": ""305505"", ""ProgramServices"": ""276405"", ""ManagementAndGeneral"": ""29100"", ""Fundraising"": ""0""}","{""EOY"": ""0""}","{""BOY"": ""231"", ""EOY"": ""474""}",2187206,904332,"{""BOY"": ""1306860"", ""EOY"": ""1282874""}","{""BOY"": ""125980"", ""EOY"": ""102794""}","{""BOY"": ""1455332"", ""EOY"": ""1433342""}","{""BOY"": ""2040"", ""EOY"": ""16145""}",,"{""BOY"": ""9203"", ""EOY"": ""11349""}",X,"{""BOY"": ""1437850"", ""EOY"": ""1398765""}",,-39085,,X,false,true,true,true,"""false""",1736.0,266420.0,False,False,265592.0,"{""Total"": ""0""}","{""Total"": ""0""}","{""Total"": ""0""}","{""Total"": ""0""}","{""Total"": ""59440"", ""ProgramServices"": ""59440""}","{""Total"": ""0""}","{""Total"": ""17714"", ""ProgramServices"": ""17714""}","{""Total"": ""5801"", ""ProgramServices"": ""5801""}","{""Total"": ""21600"", ""ManagementAndGeneral"": ""21600""}","{""Total"": ""0""}","{""Total"": ""0""}","{""Total"": ""0""}","{""Total"": ""0""}","{""Total"": ""8433"", ""ProgramServices"": ""8433""}","{""Total"": ""0""}","{""Total"": ""0""}","{""Total"": ""44077"", ""ProgramServices"": ""44077""}","{""Total"": ""0""}","{""Total"": ""0""}","{""Total"": ""806"", ""ProgramServices"": ""806""}","{""Total"": ""0""}","{""Total"": ""1108"", ""ProgramServices"": ""1108""}","{""BOY"": ""250"", ""EOY"": ""22261""}","{""EOY"": ""0""}","{""EOY"": ""0""}","{""EOY"": ""0""}","{""EOY"": ""0""}","{""BOY"": ""7628"", ""EOY"": ""7554""}","{""EOY"": ""0""}","{""EOY"": ""0""}","{""EOY"": ""0""}","{""BOY"": ""14383"", ""EOY"": ""17385""}","{""BOY"": ""20"", ""EOY"": ""48""}","{""BOY"": ""6219"", ""EOY"": ""7035""}",True,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2011-11-09T07:32:06-08:00,2011-06-30,2010-07-01,"{'Name': 'THOMAS D TURNBULL', 'Title': 'ASST. SEC/TREAS', 'DateSigned': '2011-11-09'}",2010,2016-02-24 21:20:13Z,,,,,,,581805618,"{'BusinessNameLine1': 'TORRINGTON VOA ELDERLY HOUSING INC', 'BusinessNameLine2': 'BELL PARK TOWER'}",TORR,7033415000,"{'AddressLine1': '1660 DUKE STREET', 'City': 'ALEXANDRIA', 'State': 'VA', 'ZIPCode': '22314'}",,,,,,,


In [8]:
df[['LoansFromOfficersDirectors', 'LoansFromOfficersDirectorsGrp', 
    'Activity3', 'ProgSrvcAccomActy3Grp',
    'ProgSrvcAccomActy2Grp', 'Activity2']][:2]

Unnamed: 0,LoansFromOfficersDirectors,LoansFromOfficersDirectorsGrp,Activity3,ProgSrvcAccomActy3Grp,ProgSrvcAccomActy2Grp,Activity2
0,,,,,,
1,,,,,,


In [9]:
df[['LoansFromOfficersDirectors', 'LoansFromOfficersDirectorsGrp', 
    'Activity3', 'ProgSrvcAccomActy3Grp',
    'ProgSrvcAccomActy2Grp', 'Activity2']].isna().sum()

LoansFromOfficersDirectors       2639008
LoansFromOfficersDirectorsGrp     450000
Activity3                        2609008
ProgSrvcAccomActy3Grp             450000
ProgSrvcAccomActy2Grp             450000
Activity2                        2609008
dtype: int64

In [10]:
df[['LoansFromOfficersDirectors', 'LoansFromOfficersDirectorsGrp', 
    'Activity3', 'ProgSrvcAccomActy3Grp',
    'ProgSrvcAccomActy2Grp', 'Activity2']].count()

LoansFromOfficersDirectors        830000
LoansFromOfficersDirectorsGrp    3019008
Activity3                         860000
ProgSrvcAccomActy3Grp            3019008
ProgSrvcAccomActy2Grp            3019008
Activity2                         860000
dtype: int64

In [25]:
import gc
gc.collect()

492

In [26]:
df[df['LoansFromOfficersDirectors'].notnull()][['LoansFromOfficersDirectors', 'LoansFromOfficersDirectorsGrp', 
    'Activity3', 'ProgSrvcAccomActy3Grp',
    'ProgSrvcAccomActy2Grp', 'Activity2']].sample(5)

Unnamed: 0,LoansFromOfficersDirectors,LoansFromOfficersDirectorsGrp,Activity3,ProgSrvcAccomActy3Grp,ProgSrvcAccomActy2Grp,Activity2
59565,,,"{""Expense"": ""408333"", ""Grants"": ""408333"", ""Description"": ""EDUCATION AND TRAINING- ARE CRITICAL TO MAINTAINING THE SCMCI'S ABILITY TO DELIVER THE BEST PEDIATRIC CARE. PARTNERING WITH DOCTORS FROM ALL OVER THE WORLD, WE BRING THE BEST AND BRIGHTEST PROFESSIONALS, TO TEACH NEW MEDICAL PROCEDURES, PROTOCOLS AND TECHNIQUES. SCMCI IN TURN, TRAINS RESIDENTS FROM EVERY CONTINENT,AT THE HOSPITAL AND ON LOCATION,EXPANDING THE UNDERSTANDING OF PEDIATRIC PRACTICES ALL OVER THE WORLD.""}",,,"{""Expense"": ""233924"", ""Grants"": ""233924"", ""Description"": ""MEDICAL RESEARCH- SCMCI PERFORMS RESEARCH IN ALL ITS DEPARTMENTS AND SOME OF THE RESEARCH AREAS FUNDED BY MDI INCLUDE: DIABETES,THE IMPACT OF NUTRITION ON CROHN'S DISEASE AND COLITIS,AND STEM CELL THERAPY AS A SUBSTITUTE FOR CHEMOTHERAPY TREATMENT OF SPECIFIC CHILDHOOD CANCERS.MDI ALSO FUNDS MULTIPLE PROGRAMS,SUCH AS ANXIETY DISORDERS CLINIC, ART AND MUSIC THERAPY PROGRAMS, BURN CENTER, CHILD DEVELOPMENT, CRISIS INTERVENTION, ENDOCRIN..."
86826,,,,,,
573856,,,,"{""ExpenseAmt"": ""112402"", ""RevenueAmt"": ""133101"", ""Desc"": ""The basketball program for boys and girls uses funds to pay for uniforms, field maintenance, referees, clinics, coaches fees and tournament costs.""}","{""ExpenseAmt"": ""306582"", ""RevenueAmt"": ""440271"", ""Desc"": ""The Hockey program for boys and girls supports and develops are youth players. Support includes ice time rental, equipment, training, tournament costs, clinics and insurance.""}",
187866,,,"{""Expense"": ""44205"", ""Revenue"": ""82792"", ""Description"": ""SEE SCHEDULE O""}",,,"{""Expense"": ""45814"", ""Revenue"": ""1787397"", ""Description"": ""SEE SCHEDULE O""}"
734551,,,,,,


In [15]:
df[df['LoansFromOfficersDirectorsGrp'].notnull()][['LoansFromOfficersDirectors', 'LoansFromOfficersDirectorsGrp', 
    'Activity3', 'ProgSrvcAccomActy3Grp',
    'ProgSrvcAccomActy2Grp', 'Activity2']].sample(5)

Unnamed: 0,LoansFromOfficersDirectors,LoansFromOfficersDirectorsGrp,Activity3,ProgSrvcAccomActy3Grp,ProgSrvcAccomActy2Grp,Activity2
1443876,,,,,,
1806745,,,,,,
581918,,"{""BOYAmt"": ""149000"", ""EOYAmt"": ""177356""}",,,,
632595,,"{""BOYAmt"": ""0"", ""EOYAmt"": ""0""}",,,,
2832879,,,,,,


In [21]:
df[df['Activity3'].notnull()][['LoansFromOfficersDirectors', 'LoansFromOfficersDirectorsGrp', 
    'Activity3', 'ProgSrvcAccomActy3Grp',
    'ProgSrvcAccomActy2Grp', 'Activity2']].sample(5)

Unnamed: 0,LoansFromOfficersDirectors,LoansFromOfficersDirectorsGrp,Activity3,ProgSrvcAccomActy3Grp,ProgSrvcAccomActy2Grp,Activity2
803725,,,,,"{""ExpenseAmt"": ""18416"", ""Desc"": ""GRRAND ALSO EDUCATES THE PUBLIC CONCERNING THE GOLDEN RETRIEVER AND ITS NEEDS AND ATTRIBUTES AND PROPER PET HEALTHCARE.""}",
424525,,,,,,
585728,,,,"{""ExpenseAmt"": ""43989"", ""GrantAmt"": ""17360"", ""Desc"": ""WE PROVIDED GROCERY STORE GIFT CARDS ON A MONTHLY BASIS FOR SEVERAL CHARITIES IN THE CHARLOTTESVILLE AREA WHOSE FOOD PROGRAMS INCLUDE THE PROVISION OF FOOD FOR NEEDY PEOPLE.""}","{""ExpenseAmt"": ""148843"", ""GrantAmt"": ""31140"", ""Desc"": ""WE PROVIDED SUPPORT OF LOCAL ORGANIZATIONS THAT PROVIDE SERVICES TO HOMELESS MEN AND WOMEN.""}",
299082,,,,,,
256863,,,,,,


In [16]:
df[df['ProgSrvcAccomActy3Grp'].notnull()][['LoansFromOfficersDirectors', 'LoansFromOfficersDirectorsGrp', 
    'Activity3', 'ProgSrvcAccomActy3Grp',
    'ProgSrvcAccomActy2Grp', 'Activity2']].sample(5)

Unnamed: 0,LoansFromOfficersDirectors,LoansFromOfficersDirectorsGrp,Activity3,ProgSrvcAccomActy3Grp,ProgSrvcAccomActy2Grp,Activity2
1577430,,"{""BOYAmt"": ""0"", ""EOYAmt"": ""0""}",,"{""ExpenseAmt"": ""0"", ""GrantAmt"": ""0"", ""RevenueAmt"": ""0"", ""Desc"": ""Fences For Fido actively supports and mentors new unchaining groups all over the country. This support can be both advisory and financial. We also support animal rights laws, especi...","{""ExpenseAmt"": ""46045"", ""GrantAmt"": ""0"", ""RevenueAmt"": ""0"", ""Desc"": ""This program helps provide for the basic needs of the dogs, including food, flea treatment, dog beds, and even toys. The main expenses of this program are spay/neuter services, ...",
1877352,,,,,"{""ExpenseAmt"": ""131526"", ""GrantAmt"": ""0"", ""RevenueAmt"": ""0"", ""Desc"": ""THE ADVOCACY PROJECT FOCUSES ON SYSTEMATIC RESPONSES FOR BEST MEETING THE NEEDS OF INDIAN CHILDREN AND FAMILIES CONSISTENT WITH THE INDIAN CHILD WELFARE ACT. THE ICWA LAW CENTE...",
1721593,,,,,,
2433324,,,,,,
1657610,,,,,,


In [22]:
df[df['Activity2'].notnull()][['LoansFromOfficersDirectors', 'LoansFromOfficersDirectorsGrp', 
    'Activity3', 'ProgSrvcAccomActy3Grp',
    'ProgSrvcAccomActy2Grp', 'Activity2']].sample(5)

Unnamed: 0,LoansFromOfficersDirectors,LoansFromOfficersDirectorsGrp,Activity3,ProgSrvcAccomActy3Grp,ProgSrvcAccomActy2Grp,Activity2
826465,,,,,,
749510,,,,"{""ExpenseAmt"": ""79041"", ""Desc"": ""COURT ADVOCATE PROGRAM PROVIDING ASSISTANCE THROUGH THE LEGAL AND COURT PROCESS BY QUALIFIED STAFF.""}","{""ExpenseAmt"": ""91846"", ""Desc"": ""SCHOOL EDUCATION AND ONGOING PUBLIC AWARENESS CAMPAIGNS IN THE COMMUNITY REGARDING THE PROBLEMS OF DOMESTIC VIOLENCE.""}",
255519,,,,,,
608172,,,,,,
802111,,,,,,


In [23]:
df[df['ProgSrvcAccomActy2Grp'].notnull()][['LoansFromOfficersDirectors', 'LoansFromOfficersDirectorsGrp', 
    'Activity3', 'ProgSrvcAccomActy3Grp',
    'ProgSrvcAccomActy2Grp', 'Activity2']].sample(5)

KeyboardInterrupt: 