# Overview
In this notebook I read in a concordance file that has **_all_** reconciled variables to date
- The file is called *concordance_VERIFIED.xlsx*

# TO DO:
- In second notebook simply combine and rename
- In third notebook binarize all relevant variables
    - Add a column to concordance file that notes all 'binarize' variables
- Rationalize and/or combine/modify remaining notebooks as needed in order to generate variables

# Load Packages and Connect to MongoDB

In [2]:
import sys
import time
import json

In [3]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

In [4]:
print(pd.__version__)

1.1.5


In [5]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)

#### Set working directory

In [6]:
#cd '/Users/gsaxton/Dropbox/990 e-file data'

In [7]:
cd "C:\\Users\\Gregory\\IRS 990 Control Variables\\"

C:\Users\Gregory\IRS 990 Control Variables


#### MongoDB
Depending on the project, I will store the data in SQLite or MongoDB. This time I'll use MongoDB -- it's great for storing JSON data where each observation could have different variables. Before we get to the interesting part the following code blocks set up the MongoDB environment and the new database we'll be using. 

**_Note:_** In a terminal we'll have to start MongoDB by running the command *mongod* or *sudo mongod*. Then we run the following code block here to access MongoDB.

In [8]:
import pymongo
from pymongo import MongoClient
client = MongoClient()

In [9]:
print(pymongo.__version__)

3.9.0


In [10]:
MongoClient().list_database_names()

['ICIJ',
 'OWS',
 'Panama',
 'SMC',
 'admin',
 'cashtags',
 'config',
 'irs_990_db',
 'local',
 'paradisepapers',
 'sp500',
 'test']

##### This first database we'll define is for storing the File LISTINGS information we've generated above.

In [11]:
# DEFINE MY mongoDB DATABASE
db = client['irs_990_db']

# DEFINE MY COLLECTION HOUSING 990 DATA
filings_990 = db['filings_990']

In [11]:
MongoClient().list_database_names()

['ICIJ',
 'OWS',
 'Panama',
 'SMC',
 'admin',
 'cashtags',
 'config',
 'irs_990_db',
 'local',
 'paradisepapers',
 'sp500',
 'test']

In [12]:
client.list_database_names()

['ICIJ',
 'OWS',
 'Panama',
 'SMC',
 'admin',
 'cashtags',
 'config',
 'irs_990_db',
 'local',
 'paradisepapers',
 'sp500',
 'test']

<br>When we set up our database in an earlier tutorial, we set a unique constraint on the collection based on *URL*. This averted duplicates from being inserted.

DuplicateKeyError: E11000 duplicate key error collection: irs_990_db.filings_990 index: URL_1 dup key: { : "https://s3.amazonaws.com/irs-form-990/201100129349301055_public.xml" }

db.getCollection('filings_990').find({'URL' : "https://s3.amazonaws.com/irs-form-990/201100129349301055_public.xml" })

In [14]:
#OLD CODE
#db.getCollection('filings_990').find({'URL' : "https://s3.amazonaws.com/irs-form-990/201100259349301110_public.xml" })

In [13]:
db.filings_990.find_one({'URL' : "https://s3.amazonaws.com/irs-form-990/201100289349300910_public.xml" })

{'_id': ObjectId('5d01cfed78ffca27b428aa97'),
 'OrganizationName': 'ASSEMBLEIA DE DEUS MINISTERIO BELEM CHUR',
 'ObjectId': '201100289349300910',
 'URL': 'https://s3.amazonaws.com/irs-form-990/201100289349300910_public.xml',
 'SubmittedOn': '2011-09-22',
 'DLN': '93493028009101',
 'LastUpdated': '2016-03-21T17:23:53',
 'TaxPeriod': '201012',
 'FormType': '990',
 'EIN': '954745380',
 '@xmlns': 'http://www.irs.gov/efile',
 '@returnVersion': '2010v3.2',
 'ReturnHeader': {'@binaryAttachmentCount': '0',
  'Timestamp': '2011-01-28T13:07:07-08:00',
  'TaxPeriodEndDate': '2010-12-31',
  'PreparerFirm': {'PreparerFirmBusinessName': {'BusinessNameLine1': 'VIRULAS GENERAL OFFICE'},
   'PreparerFirmUSAddress': {'AddressLine1': '4138 ATLANTIC AVE',
    'City': 'Long Beach',
    'State': 'CA',
    'ZIPCode': '90807'}},
  'ReturnType': '990',
  'TaxPeriodBeginDate': '2010-01-01',
  'Filer': {'EIN': '954745380',
   'Name': {'BusinessNameLine1': 'ASSEMBLEIA DE DEUS MINISTERIO BELEM CHUR'},
   'NameCont

In [16]:
#db.filings_990.create_index([('URL', pymongo.ASCENDING)], unique=True)

In [12]:
list(db.filings_990.index_information())

['_id_', 'URL_1']

<br>Check how many observations in the database table.

NOTE: in the *v1* file there are ~70,000 duplicates_v1


In [18]:
1617904 - 1587557

30347

In [13]:
filings_990.estimated_document_count()

1895016

In [20]:
#import timeit
#start_time = timeit.default_timer()
#df = pd.DataFrame(list(filings_990.find({}, {'URL':1, 
#    '_id':0})))
#elapsed = timeit.default_timer() - start_time
#print('# of minutes: ', elapsed/60)
#print("Number of columns:", len(df.columns))
#print("Number of observations:", len(df))
#df[:1]

In [21]:
#duplicateRowsDF = df[df.duplicated(['URL'])]
#print("Number of columns:", len(duplicateRowsDF.columns))
#print("Number of observations:", len(duplicateRowsDF))

In [22]:
#df['test'] = 1

In [23]:
#dups = df.groupby('URL').count()
#dups[:1]

In [24]:
#dups['test'].value_counts()

In [16]:
filings_990.estimated_document_count()

1895016

# Read in Concordance File
We are going to read in two codebooks. First, there is the 'concordance' file. Specifically, before re-arranging and renaming variables, we will read in the relevant section from the *master concordance* file, and then use this file to identify the relevant 'compensation' variables. In a following notebook, we will be using the *new_variable_name* field as our variable name.

In [14]:
concordance = pd.read_excel('concordance_VERIFIED.xlsx')
print('# of columns:', len(concordance.columns))
print('# of observations:', len(concordance))
concordance[:2]

# of columns: 17
# of observations: 384


Unnamed: 0,xpath,project,variable_name_new,# of Characters (newly named),variable name notes,PARSING NOTES,OTHER NOTES,description,location_code,part,data_type_xsd,python_data_type,fill_null,BINARIZE,MongoDB_Name,sub_key,sub_sub_key
0,/Return/ReturnHeader/TaxPeriodEndDate,,F9_00_HD_TAX_PER_END,,,Will be nested under ReturnHeader,,Tax period end date,HEADER,HD,DateType,string,Do not fill null,,TaxPeriodEndDate,,
1,/Return/ReturnHeader/TaxPeriodEndDt,,F9_00_HD_TAX_PER_END,,,Will be nested under ReturnHeader,,Tax period end date,HEADER,HD,DateType,string,Do not fill null,,TaxPeriodEndDt,,


<br>Check MongoDB_Name

In [15]:
print(len(concordance['MongoDB_Name'].tolist()))
print(len(set(concordance['MongoDB_Name'].tolist())))

384
328


In [16]:
mongo_cols = concordance[:]['MongoDB_Name'].tolist()
print(len(mongo_cols))
print(len(set(mongo_cols)))
mongo_cols = list(set(mongo_cols))
print(len(mongo_cols))
print(mongo_cols[:5])

384
328
328
['OtherSalariesAndWages', 'ReconcilationDonatedServices', 'RtnEarnEndowmentIncmOthFndsGrp', 'NbrVotingMembersGoverningBody', 'CYOtherExpensesAmt']


# Extract Data from MongoDB Databse

Print out a sorted list of our desired columns.

In [17]:
mongo_cols = [x for x in mongo_cols if str(x) != 'nan']
print(len(mongo_cols))

328


In [18]:
print(len(sorted(mongo_cols)))

328


<br>Use 'helper' loop to print out variables for MongoDB -- we'll copy and paste this in our next block of code

In [19]:
for c in sorted(mongo_cols):
    print("    '"+c+"'"+': 1, ')

    'AccountantCompileOrReview': 1, 
    'AccountantCompileOrReviewInd': 1, 
    'ActivityOrMissionDesc': 1, 
    'ActivityOrMissionDescription': 1, 
    'AddressChange': 1, 
    'AddressChangeInd': 1, 
    'AllAffiliatesIncluded': 1, 
    'AllAffiliatesIncludedInd': 1, 
    'AllOtherContributions': 1, 
    'AllOtherContributionsAmt': 1, 
    'AmendedReturn': 1, 
    'AmendedReturnInd': 1, 
    'AnnualDisclosureCoveredPersons': 1, 
    'AnnualDisclosureCoveredPrsnInd': 1, 
    'AuditCommittee': 1, 
    'AuditCommitteeInd': 1, 
    'BenefitsPaidToMembersCY': 1, 
    'BenefitsPaidToMembersPriorYear': 1, 
    'BuildTS': 1, 
    'BusinessOfficerGrp': 1, 
    'CYBenefitsPaidToMembersAmt': 1, 
    'CYContributionsGrantsAmt': 1, 
    'CYGrantsAndSimilarPaidAmt': 1, 
    'CYInvestmentIncomeAmt': 1, 
    'CYOtherExpensesAmt': 1, 
    'CYOtherRevenueAmt': 1, 
    'CYProgramServiceRevenueAmt': 1, 
    'CYRevenuesLessExpensesAmt': 1, 
    'CYSalariesCompEmpBnftPaidAmt': 1, 
    'CYTotalExpensesAmt

### View one row of data from our database
Paste variables from above into code block below then run to view first row of data. Note that we include four identifier columns (*EIN, OrganizationName, DLN*, and *URL*). We also include *_id* (a MongoDB column) with a '0' tag, meaning we don't want this otherwise automatically included column.

In [14]:
for f in filings_990.find({})[:1]:
    print(sorted(f.keys()))

['@documentCount', '@documentId', '@referenceDocumentId', '@returnVersion', '@xmlns', '@xmlns:xsi', '@xsi:schemaLocation', 'AccountantCompileOrReview', 'AccountsPayableAccruedExpenses', 'AccountsReceivable', 'ActivitiesConductedPartnership', 'ActivityOrMissionDescription', 'AddressChange', 'AddressPrincipalOfficerUS', 'AllOtherContributions', 'AllOtherExpenses', 'AnnualDisclosureCoveredPersons', 'AuditCommittee', 'BenefitsPaidToMembersCY', 'BenefitsPaidToMembersPriorYear', 'BsnssRltnshpThruFamilyMember', 'BsnssRltnshpWithOrganization', 'ChangesToOrganizingDocs', 'CollectionsOfArt', 'CompensationFromOtherSources', 'CompensationProcessCEO', 'CompensationProcessOther', 'ComplianceWithBackupWitholding', 'ConflictOfInterestPolicy', 'ConservationEasements', 'ConsolidatedAuditFinancialStmt', 'ContributionsGrantsCurrentYear', 'ContributionsGrantsPriorYear', 'CreditCounseling', 'DLN', 'DecisionsSubjectToApproval', 'DeductibleContributionsOfArt', 'DeductibleNonCashContributions', 'DelegationOfMa

In [24]:
df = pd.DataFrame(list(filings_990.find({}, {'_id': 0, 'EIN': 1, 'OrganizationName': 1, 'DLN': 1, 'URL': 1, 
    'TaxPeriod': 1, #'TaxYr': 1, 'TaxPeriodEndDate': 1, 'TaxPeriodEndDt': 1, 
    'ReturnHeader.TaxYear': 1, 'ReturnHeader.TaxYr': 1,
    #'ReturnHeader': 1,
    'ReturnHeader.TaxPeriodEndDate': 1, 'ReturnHeader.TaxPeriodEndDt': 1,                                       
    'AccountantCompileOrReview': 1, 
    'YearFormation': 1})[:10]))
df

Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,ReturnHeader,YearFormation,AccountantCompileOrReview
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,"{'TaxPeriodEndDate': '2010-12-31', 'TaxYear': '2010'}",1992,0
1,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,201106,581805618,"{'TaxPeriodEndDate': '2011-06-30', 'TaxYear': '2010'}",1993,false
2,HOUSTON VOA INDEPENDENT HOUSING INC HEIGHTS MANOR,https://s3.amazonaws.com/irs-form-990/201113139349301316_public.xml,93493313013161,201106,581876019,"{'TaxPeriodEndDate': '2011-06-30', 'TaxYear': '2010'}",1990,false
3,FAITH-HAVEN CORPORATION,https://s3.amazonaws.com/irs-form-990/201113139349301321_public.xml,93493313013211,201106,391083432,"{'TaxPeriodEndDate': '2011-06-30', 'TaxYear': '2010'}",1965,true
4,OCHSNER COMMUNITY HOSPITALS,https://s3.amazonaws.com/irs-form-990/201113139349301326_public.xml,93493313013261,201012,205297040,"{'TaxPeriodEndDate': '2010-12-31', 'TaxYear': '2010'}",2006,0
5,HOUSTON HOUSE FOUNDATION,https://s3.amazonaws.com/irs-form-990/201113139349301331_public.xml,93493313013311,201012,760314047,"{'TaxPeriodEndDate': '2010-12-31', 'TaxYear': '2010'}",1990,false
6,ADOPTION PLANNING INC,https://s3.amazonaws.com/irs-form-990/201113139349301336_public.xml,93493313013361,201012,581916251,"{'TaxPeriodEndDate': '2010-12-31', 'TaxYear': '2010'}",1989,false
7,CHAMPAIGN RESIDENTIAL SERVICES INC,https://s3.amazonaws.com/irs-form-990/201113139349301346_public.xml,93493313013461,201012,341200331,"{'TaxPeriodEndDate': '2010-12-31', 'TaxYear': '2010'}",1977,true
8,GLENOAKS HOSPITAL FOUNDATION,https://s3.amazonaws.com/irs-form-990/201113139349301406_public.xml,93493313014061,201012,363926044,"{'TaxPeriodEndDate': '2010-12-31', 'TaxYear': '2010'}",1992,0
9,ST PAULS CLINIC FOUNDATION INC,https://s3.amazonaws.com/irs-form-990/201113139349301431_public.xml,93493313014311,201106,202752128,"{'TaxPeriodEndDate': '2011-06-30', 'TaxYear': '2010'}",2006,0


In [25]:
df['TAXYEAR'] = df['ReturnHeader'].apply(lambda x: x['TaxYear'])

In [26]:
df 

Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,ReturnHeader,YearFormation,AccountantCompileOrReview,TAXYEAR
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,"{'TaxPeriodEndDate': '2010-12-31', 'TaxYear': '2010'}",1992,0,2010
1,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,201106,581805618,"{'TaxPeriodEndDate': '2011-06-30', 'TaxYear': '2010'}",1993,false,2010
2,HOUSTON VOA INDEPENDENT HOUSING INC HEIGHTS MANOR,https://s3.amazonaws.com/irs-form-990/201113139349301316_public.xml,93493313013161,201106,581876019,"{'TaxPeriodEndDate': '2011-06-30', 'TaxYear': '2010'}",1990,false,2010
3,FAITH-HAVEN CORPORATION,https://s3.amazonaws.com/irs-form-990/201113139349301321_public.xml,93493313013211,201106,391083432,"{'TaxPeriodEndDate': '2011-06-30', 'TaxYear': '2010'}",1965,true,2010
4,OCHSNER COMMUNITY HOSPITALS,https://s3.amazonaws.com/irs-form-990/201113139349301326_public.xml,93493313013261,201012,205297040,"{'TaxPeriodEndDate': '2010-12-31', 'TaxYear': '2010'}",2006,0,2010
5,HOUSTON HOUSE FOUNDATION,https://s3.amazonaws.com/irs-form-990/201113139349301331_public.xml,93493313013311,201012,760314047,"{'TaxPeriodEndDate': '2010-12-31', 'TaxYear': '2010'}",1990,false,2010
6,ADOPTION PLANNING INC,https://s3.amazonaws.com/irs-form-990/201113139349301336_public.xml,93493313013361,201012,581916251,"{'TaxPeriodEndDate': '2010-12-31', 'TaxYear': '2010'}",1989,false,2010
7,CHAMPAIGN RESIDENTIAL SERVICES INC,https://s3.amazonaws.com/irs-form-990/201113139349301346_public.xml,93493313013461,201012,341200331,"{'TaxPeriodEndDate': '2010-12-31', 'TaxYear': '2010'}",1977,true,2010
8,GLENOAKS HOSPITAL FOUNDATION,https://s3.amazonaws.com/irs-form-990/201113139349301406_public.xml,93493313014061,201012,363926044,"{'TaxPeriodEndDate': '2010-12-31', 'TaxYear': '2010'}",1992,0,2010
9,ST PAULS CLINIC FOUNDATION INC,https://s3.amazonaws.com/irs-form-990/201113139349301431_public.xml,93493313014311,201106,202752128,"{'TaxPeriodEndDate': '2011-06-30', 'TaxYear': '2010'}",2006,0,2010


# Read 990 DB into PANDAS DF
We can modify the above code block to read all filings into a PANDAS dataframe.

#### Iterator approach for large datasets
http://deo.im/2016/09/22/Load-data-from-mongodb-to-Pandas-DataFrame/

n.b. - I added 'sort=False' in the last row in order to turn off future warning. I may need to add this to the append function as well

In [20]:
def iterator2dataframe(iterator, chunk_size: int):
    #Turn an iterator into multiple small pandas.DataFrame
    #This is a balance between memory and efficiency
    records = []
    frames = []
    for i, record in enumerate(iterator):
        records.append(record)
        if i % chunk_size == chunk_size - 1:
            frames.append(pd.DataFrame(records))
            records = []
    if records:
        frames.append(pd.DataFrame(records))
    return pd.concat(frames, sort=False) if frames else pd.DataFrame()

## Now get key data

In [47]:
import timeit
start_time = timeit.default_timer()

dfx = iterator2dataframe(filings_990.find({}, {'_id': 0, 'EIN': 1, #'OrganizationName': 1, 'DLN': 1, 'URL': 1,                
    'Form990PartIV': 1, 
    'FundraisingActivities': 1, 
    'FundraisingActivitiesInd': 1, 
    'Gaming': 1, 
    'GamingActivitiesInd': 1, 
    'ProfessionalFundraising': 1, 
    'ProfessionalFundraisingInd': 1,
    'Organization501c': 1,
    'Organization501cInd': 1,                                               
    'Organization501c3': 1,
    'Organization501c3Ind': 1})[:10000], 1000)
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 
print("Number of columns:", len(dfx.columns))
print("Number of observations:", len(dfx))
dfx[:1]  

# of minutes:  0.09721179666667012
Number of columns: 6
Number of observations: 10000


Unnamed: 0,EIN,Organization501c3,ProfessionalFundraising,FundraisingActivities,Gaming,Organization501c
0,232705170,X,0,0,0,


In [49]:
%%time
cursor = filings_990.find({}, {'_id': 0, 'EIN': 1, #'OrganizationName': 1, 'DLN': 1, 'URL': 1,                
    'Form990PartIV': 1, 
    'FundraisingActivities': 1, 
    'FundraisingActivitiesInd': 1, 
    'Gaming': 1, 
    'GamingActivitiesInd': 1, 
    'ProfessionalFundraising': 1, 
    'ProfessionalFundraisingInd': 1,
    'Organization501c': 1,
    'Organization501cInd': 1,                                               
    'Organization501c3': 1,
    'Organization501c3Ind': 1})
dfx = pd.DataFrame()
for batch in batched(cursor, 10000):
    dfx = dfx.append(batch, ignore_index=True)
dfx[:1]

Wall time: 9min 22s


Unnamed: 0,EIN,Organization501c3,ProfessionalFundraising,FundraisingActivities,Gaming,Organization501c,Organization501c3Ind,ProfessionalFundraisingInd,FundraisingActivitiesInd,GamingActivitiesInd,Organization501cInd
0,232705170,X,0,0,0,,,,,,


In [74]:
print(len(dfx))

1895016


In [28]:
import timeit
start_time = timeit.default_timer()

df = iterator2dataframe(filings_990.find({}, {'_id': 0, 'EIN': 1, 'OrganizationName': 1, 'DLN': 1, 'URL': 1,                
    'TaxPeriod': 1,
    'ReturnHeader': 1}), 10000)
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 
print("Number of columns:", len(df.columns))
print("Number of observations:", len(df))
df[:1]    

# of minutes:  114.27548303
Number of columns: 6
Number of observations: 1895016


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,ReturnHeader
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,"{'@binaryAttachmentCount': '0', 'Timestamp': '2011-11-09T06:41:09-06:00', 'TaxPeriodEndDate': '2010-12-31', 'PreparerFirm': {'PreparerFirmBusinessName': {'BusinessNameLine1': 'CONCANNON MILLER & CO PC'}, 'PreparerFirmUSAddress': {'AddressLine1': ..."


# 2/14/2020 -- Flatten *ReturnHeader* column
# THIS IS A NEW APPROACH --> FLATTEN THEN REMOVE NON-USED COLUMNS
# BE SURE TO FOLLOW THROUGH WITH CHANGES TO THE FOLLOWING VARIABLES IN SUBSEQUENT NOTEBOOKS:
    'ReturnHeader.TaxYear': 1, 'ReturnHeader.TaxYr': 1,
    'ReturnHeader.TaxPeriodEndDate': 1, 'ReturnHeader.TaxPeriodEndDt': 1,  
    
# 2/19/2020 -- Also added in *Filer* in order to get state
- See *IRS 990 e-File Data -- CONTROL VARIABLES (A8) -- Prepare Dataset for Statistical Analysis (Select Cases - 501c3, valid data).ipynb*

In [29]:
import timeit
start_time = timeit.default_timer()
print("Number of columns:", len(df.columns))
df = pd.concat([df.drop(['ReturnHeader'], axis=1), df['ReturnHeader'].apply(pd.Series)], axis=1)
print('# of minutes: ', elapsed/60) 
print("Number of columns:", len(df.columns))
df[:2]

Number of columns: 6
# of minutes:  114.27548303
Number of columns: 28


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,@binaryAttachmentCount,Timestamp,TaxPeriodEndDate,PreparerFirm,ReturnType,TaxPeriodBeginDate,Filer,Officer,Preparer,TaxYear,BuildTS,DisasterRelief,@binaryAttachmentCnt,ReturnTs,TaxPeriodEndDt,PreparerFirmGrp,ReturnTypeCd,TaxPeriodBeginDt,BusinessOfficerGrp,PreparerPersonGrp,TaxYr,DisasterReliefTxt,FilingSecurityInformation
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,0,2011-11-09T06:41:09-06:00,2010-12-31,"{'PreparerFirmBusinessName': {'BusinessNameLine1': 'CONCANNON MILLER & CO PC'}, 'PreparerFirmUSAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY SUITE 30', 'City': 'BETHLEHEM', 'State': 'PA', 'ZIPCode': '180172285'}}",990,2010-01-01,"{'EIN': '232705170', 'Name': {'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}, 'NameControl': 'RONA', 'Phone': '8565826843', 'USAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300...","{'Name': 'ROBERT TRAA', 'Title': 'TREASURER', 'Phone': '8565826843', 'DateSigned': '2011-11-04', 'AuthorizeThirdParty': '1'}","{'Name': 'E BARRY HETZEL CPA', 'Phone': '6104335501'}",2010,2016-02-24 21:20:13Z,,,,,,,,,,,,
1,TORRINGTON VOA ELDERLY HOUSING INC BELL PARK TOWER,https://s3.amazonaws.com/irs-form-990/201113139349301311_public.xml,93493313013111,201106,581805618,0,2011-11-09T07:32:06-08:00,2011-06-30,"{'PreparerFirmBusinessName': {'BusinessNameLine1': 'MADDOX & ASSOCIATES APC'}, 'PreparerFirmUSAddress': {'AddressLine1': '5627 BANKERS AVE BLDG 2', 'City': 'BATON ROUGE', 'State': 'LA', 'ZIPCode': '708082610'}}",990,2010-07-01,"{'EIN': '581805618', 'Name': {'BusinessNameLine1': 'TORRINGTON VOA ELDERLY HOUSING INC', 'BusinessNameLine2': 'BELL PARK TOWER'}, 'NameControl': 'TORR', 'Phone': '7033415000', 'USAddress': {'AddressLine1': '1660 DUKE STREET', 'City': 'ALEXANDRIA'...","{'Name': 'THOMAS D TURNBULL', 'Title': 'ASST. SEC/TREAS', 'DateSigned': '2011-11-09'}","{'Name': 'WILLIAM B BEALE', 'Phone': '2259263360'}",2010,2016-02-24 21:20:13Z,,,,,,,,,,,,


<br>Save DF

In [30]:
import timeit
start_time = timeit.default_timer()
df.to_pickle('all filings nov. 2020 - select ReturnHeader variables.pkl.gz', compression='gzip')
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 

# of minutes:  4.347203583333339


In [31]:
print([c for c in df.columns.tolist() if c not in mongo_cols])
omit_cols = ['@binaryAttachmentCount', 'PreparerFirm', 'ReturnType', 'TaxPeriodBeginDate', 'Preparer', 
             'DisasterRelief', '@binaryAttachmentCnt', 'PreparerFirmGrp', 'ReturnTypeCd', 'TaxPeriodBeginDt', 
             'PreparerPersonGrp', 'DisasterReliefTxt', 'FilingSecurityInformation']
print(len(df.columns))
df = df[[c for c in df.columns.tolist() if c not in omit_cols]]
print(len(df))
print(len(df.columns))
df[:1]

['OrganizationName', 'URL', 'DLN', 'EIN', '@binaryAttachmentCount', 'PreparerFirm', 'ReturnType', 'TaxPeriodBeginDate', 'Preparer', 'DisasterRelief', '@binaryAttachmentCnt', 'PreparerFirmGrp', 'ReturnTypeCd', 'TaxPeriodBeginDt', 'PreparerPersonGrp', 'DisasterReliefTxt', 'FilingSecurityInformation']
28
1895016
15


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,Timestamp,TaxPeriodEndDate,Filer,Officer,TaxYear,BuildTS,ReturnTs,TaxPeriodEndDt,BusinessOfficerGrp,TaxYr
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,2011-11-09T06:41:09-06:00,2010-12-31,"{'EIN': '232705170', 'Name': {'BusinessNameLine1': 'RONALD MCDONALD HOUSE CHARITIES-', 'BusinessNameLine2': 'PHILADELPHIA REGION INC'}, 'NameControl': 'RONA', 'Phone': '8565826843', 'USAddress': {'AddressLine1': '1525 VALLEY CENTER PARKWAY NO 300...","{'Name': 'ROBERT TRAA', 'Title': 'TREASURER', 'Phone': '8565826843', 'DateSigned': '2011-11-04', 'AuthorizeThirdParty': '1'}",2010,2016-02-24 21:20:13Z,,,,


<br>Save *ReturnHeader* DF

In [32]:
import timeit
start_time = timeit.default_timer()
df.to_pickle('all filings nov. 2020 - select ReturnHeader variables.pkl.gz', compression='gzip')
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 

# of minutes:  2.9783590866666296


# ENDED HERE 11/23/2020

# 2/14/2020
# Update remainder by including only *ReturnHeader* instead of these four:

    'ReturnHeader.TaxYear': 1, 'ReturnHeader.TaxYr': 1,
    'ReturnHeader.TaxPeriodEndDate': 1, 'ReturnHeader.TaxPeriodEndDt': 1,  

In [21]:
import timeit
start_time = timeit.default_timer()

df = iterator2dataframe(filings_990.find({}, {'_id': 0, 'EIN': 1, 'OrganizationName': 1, 'DLN': 1, 'URL': 1,                
    'TaxPeriod': 1,
    #'ReturnHeader.TaxYear': 1, 'ReturnHeader.TaxYr': 1,
    #'ReturnHeader.TaxPeriodEndDate': 1, 'ReturnHeader.TaxPeriodEndDt': 1,  
    'ReturnHeader': 1,                                              
    'AccountantCompileOrReview': 1, 
    'AccountantCompileOrReviewInd': 1, 
    'ActivityOrMissionDesc': 1, 
    'ActivityOrMissionDescription': 1, 
    'AddressChange': 1, 
    'AddressChangeInd': 1, 
    'AllAffiliatesIncluded': 1, 
    'AllAffiliatesIncludedInd': 1, 
    'AllOtherContributions': 1, 
    'AllOtherContributionsAmt': 1, 
    'AmendedReturn': 1, 
    'AmendedReturnInd': 1, 
    'AnnualDisclosureCoveredPersons': 1, 
    'AnnualDisclosureCoveredPrsnInd': 1, 
    'AuditCommittee': 1, 
    'AuditCommitteeInd': 1, 
    'BenefitsPaidToMembersCY': 1, 
    'BenefitsPaidToMembersPriorYear': 1, 
    'CYBenefitsPaidToMembersAmt': 1, 
    'CYContributionsGrantsAmt': 1, 
    'CYGrantsAndSimilarPaidAmt': 1, 
    'CYInvestmentIncomeAmt': 1, 
    'CYOtherExpensesAmt': 1, 
    'CYOtherRevenueAmt': 1, 
    'CYProgramServiceRevenueAmt': 1, 
    'CYRevenuesLessExpensesAmt': 1, 
    'CYSalariesCompEmpBnftPaidAmt': 1, 
    'CYTotalExpensesAmt': 1, 
    'CYTotalFundraisingExpenseAmt': 1, 
    'CYTotalProfFndrsngExpnsAmt': 1, 
    'CYTotalRevenueAmt': 1, 
    'CashNonInterestBearing': 1, 
    'CashNonInterestBearingGrp': 1, 
    'ChangeToOrgDocumentsInd': 1, 
    'ChangesToOrganizingDocs': 1, 
    'CntrbtnsRprtdFundraisingEvents': 1, 
    'CntrctRcvdGreaterThan100KCnt': 1, 
    'CompCurrentOfcrDirectorsGrp': 1, 
    'CompCurrentOfficersDirectors': 1, 
    'CompDisqualPersons': 1, 
    'CompDisqualPersonsGrp': 1, 
    'CompensationFromOtherSources': 1, 
    'CompensationFromOtherSrcsInd': 1, 
    'CompensationProcessCEO': 1, 
    'CompensationProcessCEOInd': 1, 
    'CompensationProcessOther': 1, 
    'CompensationProcessOtherInd': 1, 
    'ConflictOfInterestPolicy': 1, 
    'ConflictOfInterestPolicyInd': 1, 
    'ContractTerminationInd': 1, 
    'ContriRptFundraisingEventAmt': 1, 
    'ContributionsGrantsCurrentYear': 1, 
    'ContributionsGrantsPriorYear': 1, 
    'CostOfGoodsSold': 1, 
    'CostOfGoodsSoldAmt': 1, 
    'CountryLegalDomicile': 1, 
    'DecisionsSubjectToApprovaInd': 1, 
    'DecisionsSubjectToApproval': 1, 
    'DelegationOfManagementDuties': 1, 
    'DelegationOfMgmtDutiesInd': 1, 
    'DoNotFollowSFAS117': 1, 
    'DocumentRetentionPolicy': 1, 
    'DocumentRetentionPolicyInd': 1, 
    'DonatedServicesAndUseFcltsAmt': 1, 
    'ElectionOfBoardMembers': 1, 
    'ElectionOfBoardMembersInd': 1, 
    'FSAudited': 1, 
    'FSAuditedInd': 1, 
    'FamilyOrBusinessRelationship': 1, 
    'FamilyOrBusinessRlnInd': 1, 
    'FederalGrantAuditPerformed': 1, 
    'FederalGrantAuditPerformedInd': 1, 
    'FederalGrantAuditRequired': 1, 
    'FederalGrantAuditRequiredInd': 1, 
    'FederatedCampaigns': 1, 
    'FederatedCampaignsAmt': 1, 
    'FeesForServicesAccounting': 1, 
    'FeesForServicesAccountingGrp': 1, 
    'FeesForServicesInvstMgmntFees': 1, 
    'FeesForServicesLegal': 1, 
    'FeesForServicesLegalGrp': 1, 
    'FeesForServicesLobbying': 1, 
    'FeesForServicesLobbyingGrp': 1, 
    'FeesForServicesManagement': 1, 
    'FeesForServicesManagementGrp': 1, 
    'FeesForServicesOther': 1, 
    'FeesForServicesOtherGrp': 1, 
    'FeesForServicesProfFundraising': 1, 
    'FeesForSrvcInvstMgmntFeesGrp': 1, 
    'FinalReturnInd': 1, 
    'FollowSFAS117': 1, 
    'Form990ProvidedToGoverningBody': 1, 
    'Form990ProvidedToGvrnBodyInd': 1, 
    'FormationYr': 1, 
    'FormerOfcrEmployeesListedInd': 1, 
    'FormersListed': 1, 
    'FundraisingAmt': 1, 
    'FundraisingDirectExpenses': 1, 
    'FundraisingDirectExpensesAmt': 1, 
    'FundraisingEvents': 1, 
    'FundraisingGrossIncomeAmt': 1, 
    'GamingDirectExpenses': 1, 
    'GamingDirectExpensesAmt': 1, 
    'GamingGrossIncomeAmt': 1, 
    'GoverningBodyVotingMembersCnt': 1, 
    'GovernmentGrants': 1, 
    'GovernmentGrantsAmt': 1, 
    'GrantsAndSimilarAmntsCY': 1, 
    'GrantsAndSimilarAmntsPriorYear': 1, 
    'GrossIncomeFundraisingEvents': 1, 
    'GrossIncomeGaming': 1, 
    'GrossReceipts': 1, 
    'GrossReceiptsAmt': 1, 
    'GrossSalesOfInventory': 1, 
    'GrossSalesOfInventoryAmt': 1, 
    'GroupExemptionNum': 1, 
    'GroupExemptionNumber': 1, 
    'GroupReturnForAffiliates': 1, 
    'GroupReturnForAffiliatesInd': 1, 
    'IndependentVotingMemberCnt': 1, 
    'IndivRcvdGreaterThan100KCnt': 1, 
    'InitialReturn': 1, 
    'InitialReturnInd': 1, 
    'InvestmentExpenseAmt': 1, 
    'InvestmentInJointVenture': 1, 
    'InvestmentInJointVentureInd': 1, 
    'InvestmentIncomeCurrentYear': 1, 
    'InvestmentIncomePriorYear': 1, 
    'LandBldgEquipAccumDeprecAmt': 1, 
    'LandBldgEquipCostOrOtherBssAmt': 1, 
    'LandBldgEquipmentAccumDeprec': 1, 
    'LandBuildingsEquipmentBasis': 1, 
    'LegalDomicileCountryCd': 1, 
    'LegalDomicileStateCd': 1, 
    'LoansFromOfficersDirectors': 1, 
    'LoansFromOfficersDirectorsGrp': 1, 
    'LocalChapters': 1, 
    'LocalChaptersInd': 1, 
    'MaterialDiversionOrMisuse': 1, 
    'MaterialDiversionOrMisuseInd': 1, 
    'MembersOrStockholders': 1, 
    'MembersOrStockholdersInd': 1, 
    'MembershipDues': 1, 
    'MembershipDuesAmt': 1, 
    'MethodOfAccountingAccrual': 1, 
    'MethodOfAccountingAccrualInd': 1, 
    'MethodOfAccountingCash': 1, 
    'MethodOfAccountingCashInd': 1, 
    'MethodOfAccountingOther': 1, 
    'MethodOfAccountingOtherInd': 1, 
    'MinutesOfCommittees': 1, 
    'MinutesOfCommitteesInd': 1, 
    'MinutesOfGoverningBody': 1, 
    'MinutesOfGoverningBodyInd': 1, 
    'MortNotesPyblSecuredInvestProp': 1, 
    'MortgNotesPyblScrdInvstPropGrp': 1, 
    'NameOfPrincipalOfficerPerson': 1, 
    'NbrIndependentVotingMembers': 1, 
    'NbrVotingGoverningBodyMembers': 1, 
    'NbrVotingMembersGoverningBody': 1, 
    'NetAssetsOrFundBalancesBOY': 1, 
    'NetAssetsOrFundBalancesBOYAmt': 1, 
    'NetAssetsOrFundBalancesEOY': 1, 
    'NetAssetsOrFundBalancesEOYAmt': 1, 
    'NetUnrelatedBusTxblIncmAmt': 1, 
    'NetUnrelatedBusinessTxblIncome': 1, 
    'NetUnrlzdGainsLossesInvstAmt': 1, 
    'NoListedPersonsCompensated': 1, 
    'NoListedPersonsCompensatedInd': 1, 
    'NoncashContributions': 1, 
    'NoncashContributionsAmt': 1, 
    'NumberIndependentVotingMembers': 1, 
    'NumberIndividualsGT100K': 1, 
    'NumberOfContractorsGT100K': 1, 
    'OfficerMailingAddress': 1, 
    'OfficerMailingAddressInd': 1, 
    'OrgDoesNotFollowSFAS117Ind': 1, 
    'Organization4947a1': 1, 
    'Organization4947a1NotPFInd': 1, 
    'Organization501c': 1, 
    'Organization501c3': 1, 
    'Organization501c3Ind': 1, 
    'Organization501cInd': 1, 
    'OrganizationFollowsSFAS117Ind': 1, 
    'OtherEmployeeBenefits': 1, 
    'OtherEmployeeBenefitsGrp': 1, 
    'OtherExpensePriorYear': 1, 
    'OtherExpensesCurrentYear': 1, 
    'OtherLiabilities': 1, 
    'OtherLiabilitiesGrp': 1, 
    'OtherOrganizationDsc': 1, 
    'OtherRevenueCurrentYear': 1, 
    'OtherRevenuePriorYear': 1, 
    'OtherRevenueTotalAmt': 1, 
    'OtherSalariesAndWages': 1, 
    'OtherSalariesAndWagesGrp': 1, 
    'OtherWebsite': 1, 
    'OtherWebsiteInd': 1, 
    'OwnWebsite': 1, 
    'OwnWebsiteInd': 1, 
    'PYBenefitsPaidToMembersAmt': 1, 
    'PYContributionsGrantsAmt': 1, 
    'PYGrantsAndSimilarPaidAmt': 1, 
    'PYInvestmentIncomeAmt': 1, 
    'PYOtherExpensesAmt': 1, 
    'PYOtherRevenueAmt': 1, 
    'PYProgramServiceRevenueAmt': 1, 
    'PYRevenuesLessExpensesAmt': 1, 
    'PYSalariesCompEmpBnftPaidAmt': 1, 
    'PYTotalExpensesAmt': 1, 
    'PYTotalProfFndrsngExpnsAmt': 1, 
    'PYTotalRevenueAmt': 1, 
    'PayrollTaxes': 1, 
    'PayrollTaxesGrp': 1, 
    'PensionPlanContributions': 1, 
    'PensionPlanContributionsGrp': 1, 
    'PoliciesReferenceChapters': 1, 
    'PoliciesReferenceChaptersInd': 1, 
    'PrincipalOfficerNm': 1, 
    'PriorPeriodAdjustmentsAmt': 1, 
    'ProgramServiceRevenueCY': 1, 
    'ProgramServiceRevenuePriorYear': 1, 
    'ReconcilationDonatedServices': 1, 
    'ReconcilationInvestExpenses': 1, 
    'ReconcilationPriorAdjustment': 1, 
    'ReconcilationRevenueExpenses': 1, 
    'ReconcilationRevenueExpnssAmt': 1, 
    'ReconciliationUnrealizedInvest': 1, 
    'RegularMonitoringEnforcement': 1, 
    'RegularMonitoringEnfrcInd': 1, 
    'RelatedOrganizations': 1, 
    'RelatedOrganizationsAmt': 1, 
    'RetainedEarningsEndowmentEtc': 1, 
    #'ReturnHeader.TaxPeriodEndDate': 1, 
    #'ReturnHeader.TaxPeriodEndDt': 1, 
    #'ReturnHeader.TaxYear': 1, 
    #'ReturnHeader.TaxYr': 1, 
    'RevenuesLessExpensesCY': 1, 
    'RevenuesLessExpensesPriorYear': 1, 
    'RtnEarnEndowmentIncmOthFndsGrp': 1, 
    'SalariesEtcCurrentYear': 1, 
    'SalariesEtcPriorYear': 1, 
    'SavingsAndTempCashInvestments': 1, 
    'SavingsAndTempCashInvstGrp': 1, 
    'SpecialConditionDesc': 1, 
    'SpecialConditionDescription': 1, 
    'StateLegalDomicile': 1, 
    'StatesWhereCopyOfReturnIsFiled': 1, 
    'StatesWhereCopyOfReturnIsFldCd': 1, 
    'TaxExemptBondLiabilities': 1, 
    'TaxExemptBondLiabilitiesGrp': 1, 
    'TaxPeriod': 1, 
    'TerminatedReturn': 1, 
    'TerminationOrContraction': 1, 
    'TotReportableCompRltdOrgAmt': 1, 
    'TotalAssets': 1, 
    'TotalAssetsBOY': 1, 
    'TotalAssetsBOYAmt': 1, 
    'TotalAssetsEOY': 1, 
    'TotalAssetsEOYAmt': 1, 
    'TotalAssetsGrp': 1, 
    'TotalCompGT150K': 1, 
    'TotalCompGreaterThan150KInd': 1, 
    'TotalContributions': 1, 
    'TotalContributionsAmt': 1, 
    'TotalEmployeeCnt': 1, 
    'TotalExpensesCurrentYear': 1, 
    'TotalExpensesPriorYear': 1, 
    'TotalFunctionalExpenses': 1, 
    'TotalFunctionalExpensesGrp': 1, 
    'TotalFundrsngExpCurrentYear': 1, 
    'TotalGrossUBI': 1, 
    'TotalGrossUBIAmt': 1, 
    'TotalLiabilitiesBOY': 1, 
    'TotalLiabilitiesBOYAmt': 1, 
    'TotalLiabilitiesEOY': 1, 
    'TotalLiabilitiesEOYAmt': 1, 
    'TotalNbrEmployees': 1, 
    'TotalNbrVolunteers': 1, 
    'TotalOtherCompensation': 1, 
    'TotalOtherCompensationAmt': 1, 
    'TotalOtherRevenue': 1, 
    'TotalProfFundrsngExpCY': 1, 
    'TotalProfFundrsngExpPriorYear': 1, 
    'TotalProgramServiceRevenue': 1, 
    'TotalProgramServiceRevenueAmt': 1, 
    'TotalReportableCompFrmRltdOrgs': 1, 
    'TotalReportableCompFromOrg': 1, 
    'TotalReportableCompFromOrgAmt': 1, 
    'TotalRevenue': 1, 
    'TotalRevenueCurrentYear': 1, 
    'TotalRevenueGrp': 1, 
    'TotalRevenuePriorYear': 1, 
    'TotalVolunteersCnt': 1, 
    'TypeOfOrgOtherDescription': 1, 
    'TypeOfOrganizationAssocInd': 1, 
    'TypeOfOrganizationAssociation': 1, 
    'TypeOfOrganizationCorpInd': 1, 
    'TypeOfOrganizationCorporation': 1, 
    'TypeOfOrganizationOther': 1, 
    'TypeOfOrganizationOtherInd': 1, 
    'TypeOfOrganizationTrust': 1, 
    'TypeOfOrganizationTrustInd': 1, 
    'UnsecuredNotesLoansPayable': 1, 
    'UnsecuredNotesLoansPayableGrp': 1, 
    'UponRequest': 1, 
    'UponRequestInd': 1, 
    'VotingMembersGoverningBodyCnt': 1, 
    'VotingMembersIndependentCnt': 1, 
    'WebSite': 1, 
    'WebsiteAddressTxt': 1, 
    'WhistleblowerPolicy': 1, 
    'WhistleblowerPolicyInd': 1, 
    'WrittenPolicyOrProcedure': 1, 
    'WrittenPolicyOrProcedureInd': 1, 
    'YearFormation': 1}), 10000)
elapsed = timeit.default_timer() - start_time
print('# of minutes: ', elapsed/60) 
print("Number of columns:", len(df.columns))
print("Number of observations:", len(df))
df[:1]    

KeyboardInterrupt: 

<br>Feb. 2020 run was 1.7 million observations and previous run was 1,617,920 observations. <br><br>

In [22]:
def batched(cursor, batch_size):
    batch = []
    for doc in cursor:
        batch.append(doc)
        if batch and not len(batch) % batch_size:
            yield batch
            batch = []
    if batch:   # last documents
        yield batch

In [23]:
cursor = filings_990.find({}, {'_id': 0, 'EIN': 1, 'OrganizationName': 1, 'DLN': 1, 'URL': 1,  'ReturnHeader': 1,
    'AccountantCompileOrReview': 1, 
    'AccountantCompileOrReviewInd': 1, 
    'ActivityOrMissionDesc': 1, 
    'ActivityOrMissionDescription': 1, 
    'AddressChange': 1, 
    'AddressChangeInd': 1, 
    'AllAffiliatesIncluded': 1, 
    'AllAffiliatesIncludedInd': 1, 
    'AllOtherContributions': 1, 
    'AllOtherContributionsAmt': 1, 
    'AmendedReturn': 1, 
    'AmendedReturnInd': 1, 
    'AnnualDisclosureCoveredPersons': 1, 
    'AnnualDisclosureCoveredPrsnInd': 1, 
    'AuditCommittee': 1, 
    'AuditCommitteeInd': 1, 
    'BenefitsPaidToMembersCY': 1, 
    'BenefitsPaidToMembersPriorYear': 1, 
    'BuildTS': 1, 
    'BusinessOfficerGrp': 1, 
    'CYBenefitsPaidToMembersAmt': 1, 
    'CYContributionsGrantsAmt': 1, 
    'CYGrantsAndSimilarPaidAmt': 1, 
    'CYInvestmentIncomeAmt': 1, 
    'CYOtherExpensesAmt': 1, 
    'CYOtherRevenueAmt': 1, 
    'CYProgramServiceRevenueAmt': 1, 
    'CYRevenuesLessExpensesAmt': 1, 
    'CYSalariesCompEmpBnftPaidAmt': 1, 
    'CYTotalExpensesAmt': 1, 
    'CYTotalFundraisingExpenseAmt': 1, 
    'CYTotalProfFndrsngExpnsAmt': 1, 
    'CYTotalRevenueAmt': 1, 
    'CashNonInterestBearing': 1, 
    'CashNonInterestBearingGrp': 1, 
    'ChangeToOrgDocumentsInd': 1, 
    'ChangesToOrganizingDocs': 1, 
    'CntrbtnsRprtdFundraisingEvents': 1, 
    'CntrctRcvdGreaterThan100KCnt': 1, 
    'CompCurrentOfcrDirectorsGrp': 1, 
    'CompCurrentOfficersDirectors': 1, 
    'CompDisqualPersons': 1, 
    'CompDisqualPersonsGrp': 1, 
    'CompensationFromOtherSources': 1, 
    'CompensationFromOtherSrcsInd': 1, 
    'CompensationProcessCEO': 1, 
    'CompensationProcessCEOInd': 1, 
    'CompensationProcessOther': 1, 
    'CompensationProcessOtherInd': 1, 
    'ConflictOfInterestPolicy': 1, 
    'ConflictOfInterestPolicyInd': 1, 
    'ContractTerminationInd': 1, 
    'ContriRptFundraisingEventAmt': 1, 
    'ContributionsGrantsCurrentYear': 1, 
    'ContributionsGrantsPriorYear': 1, 
    'CostOfGoodsSold': 1, 
    'CostOfGoodsSoldAmt': 1, 
    'CountryLegalDomicile': 1, 
    'DecisionsSubjectToApprovaInd': 1, 
    'DecisionsSubjectToApproval': 1, 
    'DelegationOfManagementDuties': 1, 
    'DelegationOfMgmtDutiesInd': 1, 
    'DoNotFollowSFAS117': 1, 
    'DocumentRetentionPolicy': 1, 
    'DocumentRetentionPolicyInd': 1, 
    'DonatedServicesAndUseFcltsAmt': 1, 
    'ElectionOfBoardMembers': 1, 
    'ElectionOfBoardMembersInd': 1, 
    'FSAudited': 1, 
    'FSAuditedInd': 1, 
    'FamilyOrBusinessRelationship': 1, 
    'FamilyOrBusinessRlnInd': 1, 
    'FederalGrantAuditPerformed': 1, 
    'FederalGrantAuditPerformedInd': 1, 
    'FederalGrantAuditRequired': 1, 
    'FederalGrantAuditRequiredInd': 1, 
    'FederatedCampaigns': 1, 
    'FederatedCampaignsAmt': 1, 
    'FeesForServicesAccounting': 1, 
    'FeesForServicesAccountingGrp': 1, 
    'FeesForServicesInvstMgmntFees': 1, 
    'FeesForServicesLegal': 1, 
    'FeesForServicesLegalGrp': 1, 
    'FeesForServicesLobbying': 1, 
    'FeesForServicesLobbyingGrp': 1, 
    'FeesForServicesManagement': 1, 
    'FeesForServicesManagementGrp': 1, 
    'FeesForServicesOther': 1, 
    'FeesForServicesOtherGrp': 1, 
    'FeesForServicesProfFundraising': 1, 
    'FeesForSrvcInvstMgmntFeesGrp': 1, 
    'Filer': 1, 
    'FinalReturnInd': 1, 
    'FollowSFAS117': 1, 
    'Form990ProvidedToGoverningBody': 1, 
    'Form990ProvidedToGvrnBodyInd': 1, 
    'FormationYr': 1, 
    'FormerOfcrEmployeesListedInd': 1, 
    'FormersListed': 1, 
    'FundraisingActivities': 1, 
    'FundraisingActivitiesInd': 1, 
    'FundraisingAmt': 1, 
    'FundraisingDirectExpenses': 1, 
    'FundraisingDirectExpensesAmt': 1, 
    'FundraisingEvents': 1, 
    'FundraisingGrossIncomeAmt': 1, 
    'Gaming': 1, 
    'GamingActivitiesInd': 1, 
    'GamingDirectExpenses': 1, 
    'GamingDirectExpensesAmt': 1, 
    'GamingGrossIncomeAmt': 1, 
    'GoverningBodyVotingMembersCnt': 1, 
    'GovernmentGrants': 1, 
    'GovernmentGrantsAmt': 1, 
    'GrantsAndSimilarAmntsCY': 1, 
    'GrantsAndSimilarAmntsPriorYear': 1, 
    'GrossIncomeFundraisingEvents': 1, 
    'GrossIncomeGaming': 1, 
    'GrossReceipts': 1, 
    'GrossReceiptsAmt': 1, 
    'GrossSalesOfInventory': 1, 
    'GrossSalesOfInventoryAmt': 1, 
    'GroupExemptionNum': 1, 
    'GroupExemptionNumber': 1, 
    'GroupReturnForAffiliates': 1, 
    'GroupReturnForAffiliatesInd': 1, 
    'IndependentVotingMemberCnt': 1, 
    'IndivRcvdGreaterThan100KCnt': 1, 
    'InitialReturn': 1, 
    'InitialReturnInd': 1, 
    'InvestmentExpenseAmt': 1, 
    'InvestmentInJointVenture': 1, 
    'InvestmentInJointVentureInd': 1, 
    'InvestmentIncomeCurrentYear': 1, 
    'InvestmentIncomePriorYear': 1, 
    'LandBldgEquipAccumDeprecAmt': 1, 
    'LandBldgEquipCostOrOtherBssAmt': 1, 
    'LandBldgEquipmentAccumDeprec': 1, 
    'LandBuildingsEquipmentBasis': 1, 
    'LegalDomicileCountryCd': 1, 
    'LegalDomicileStateCd': 1, 
    'LoansFromOfficersDirectors': 1, 
    'LoansFromOfficersDirectorsGrp': 1, 
    'LocalChapters': 1, 
    'LocalChaptersInd': 1, 
    'MaterialDiversionOrMisuse': 1, 
    'MaterialDiversionOrMisuseInd': 1, 
    'MembersOrStockholders': 1, 
    'MembersOrStockholdersInd': 1, 
    'MembershipDues': 1, 
    'MembershipDuesAmt': 1, 
    'MethodOfAccountingAccrual': 1, 
    'MethodOfAccountingAccrualInd': 1, 
    'MethodOfAccountingCash': 1, 
    'MethodOfAccountingCashInd': 1, 
    'MethodOfAccountingOther': 1, 
    'MethodOfAccountingOtherInd': 1, 
    'MinutesOfCommittees': 1, 
    'MinutesOfCommitteesInd': 1, 
    'MinutesOfGoverningBody': 1, 
    'MinutesOfGoverningBodyInd': 1, 
    'MortNotesPyblSecuredInvestProp': 1, 
    'MortgNotesPyblScrdInvstPropGrp': 1, 
    'NameOfPrincipalOfficerPerson': 1, 
    'NbrIndependentVotingMembers': 1, 
    'NbrVotingGoverningBodyMembers': 1, 
    'NbrVotingMembersGoverningBody': 1, 
    'NetAssetsOrFundBalancesBOY': 1, 
    'NetAssetsOrFundBalancesBOYAmt': 1, 
    'NetAssetsOrFundBalancesEOY': 1, 
    'NetAssetsOrFundBalancesEOYAmt': 1, 
    'NetUnrelatedBusTxblIncmAmt': 1, 
    'NetUnrelatedBusinessTxblIncome': 1, 
    'NetUnrlzdGainsLossesInvstAmt': 1, 
    'NoListedPersonsCompensated': 1, 
    'NoListedPersonsCompensatedInd': 1, 
    'NoncashContributions': 1, 
    'NoncashContributionsAmt': 1, 
    'NumberIndependentVotingMembers': 1, 
    'NumberIndividualsGT100K': 1, 
    'NumberOfContractorsGT100K': 1, 
    'Officer': 1, 
    'OfficerMailingAddress': 1, 
    'OfficerMailingAddressInd': 1, 
    'OrgDoesNotFollowSFAS117Ind': 1, 
    'Organization4947a1': 1, 
    'Organization4947a1NotPFInd': 1, 
    'Organization501c': 1, 
    'Organization501c3': 1, 
    'Organization501c3Ind': 1, 
    'Organization501cInd': 1, 
    'OrganizationFollowsSFAS117Ind': 1, 
    'OtherEmployeeBenefits': 1, 
    'OtherEmployeeBenefitsGrp': 1, 
    'OtherExpensePriorYear': 1, 
    'OtherExpensesCurrentYear': 1, 
    'OtherLiabilities': 1, 
    'OtherLiabilitiesGrp': 1, 
    'OtherOrganizationDsc': 1, 
    'OtherRevenueCurrentYear': 1, 
    'OtherRevenuePriorYear': 1, 
    'OtherRevenueTotalAmt': 1, 
    'OtherSalariesAndWages': 1, 
    'OtherSalariesAndWagesGrp': 1, 
    'OtherWebsite': 1, 
    'OtherWebsiteInd': 1, 
    'OwnWebsite': 1, 
    'OwnWebsiteInd': 1, 
    'PYBenefitsPaidToMembersAmt': 1, 
    'PYContributionsGrantsAmt': 1, 
    'PYGrantsAndSimilarPaidAmt': 1, 
    'PYInvestmentIncomeAmt': 1, 
    'PYOtherExpensesAmt': 1, 
    'PYOtherRevenueAmt': 1, 
    'PYProgramServiceRevenueAmt': 1, 
    'PYRevenuesLessExpensesAmt': 1, 
    'PYSalariesCompEmpBnftPaidAmt': 1, 
    'PYTotalExpensesAmt': 1, 
    'PYTotalProfFndrsngExpnsAmt': 1, 
    'PYTotalRevenueAmt': 1, 
    'PayrollTaxes': 1, 
    'PayrollTaxesGrp': 1, 
    'PensionPlanContributions': 1, 
    'PensionPlanContributionsGrp': 1, 
    'PoliciesReferenceChapters': 1, 
    'PoliciesReferenceChaptersInd': 1, 
    'PrincipalOfficerNm': 1, 
    'PriorPeriodAdjustmentsAmt': 1, 
    'ProfessionalFundraising': 1, 
    'ProfessionalFundraisingInd': 1, 
    'ProgramServiceRevenueCY': 1, 
    'ProgramServiceRevenuePriorYear': 1, 
    'ReconcilationDonatedServices': 1, 
    'ReconcilationInvestExpenses': 1, 
    'ReconcilationPriorAdjustment': 1, 
    'ReconcilationRevenueExpenses': 1, 
    'ReconcilationRevenueExpnssAmt': 1, 
    'ReconciliationUnrealizedInvest': 1, 
    'RegularMonitoringEnforcement': 1, 
    'RegularMonitoringEnfrcInd': 1, 
    'RelatedOrganizations': 1, 
    'RelatedOrganizationsAmt': 1, 
    'RetainedEarningsEndowmentEtc': 1, 
    'ReturnTs': 1, 
    'RevenuesLessExpensesCY': 1, 
    'RevenuesLessExpensesPriorYear': 1, 
    'RtnEarnEndowmentIncmOthFndsGrp': 1, 
    'SalariesEtcCurrentYear': 1, 
    'SalariesEtcPriorYear': 1, 
    'SavingsAndTempCashInvestments': 1, 
    'SavingsAndTempCashInvstGrp': 1, 
    'SpecialConditionDesc': 1, 
    'SpecialConditionDescription': 1, 
    'StateLegalDomicile': 1, 
    'StatesWhereCopyOfReturnIsFiled': 1, 
    'StatesWhereCopyOfReturnIsFldCd': 1, 
    'TaxExemptBondLiabilities': 1, 
    'TaxExemptBondLiabilitiesGrp': 1, 
    'TaxPeriod': 1, 
    'TaxPeriodEndDate': 1, 
    'TaxPeriodEndDt': 1, 
    'TaxYear': 1, 
    'TaxYr': 1, 
    'TerminatedReturn': 1, 
    'TerminationOrContraction': 1, 
    'Timestamp': 1, 
    'TotReportableCompRltdOrgAmt': 1, 
    'TotalAssets': 1, 
    'TotalAssetsBOY': 1, 
    'TotalAssetsBOYAmt': 1, 
    'TotalAssetsEOY': 1, 
    'TotalAssetsEOYAmt': 1, 
    'TotalAssetsGrp': 1, 
    'TotalCompGT150K': 1, 
    'TotalCompGreaterThan150KInd': 1, 
    'TotalContributions': 1, 
    'TotalContributionsAmt': 1, 
    'TotalEmployeeCnt': 1, 
    'TotalExpensesCurrentYear': 1, 
    'TotalExpensesPriorYear': 1, 
    'TotalFunctionalExpenses': 1, 
    'TotalFunctionalExpensesGrp': 1, 
    'TotalFundrsngExpCurrentYear': 1, 
    'TotalGrossUBI': 1, 
    'TotalGrossUBIAmt': 1, 
    'TotalLiabilitiesBOY': 1, 
    'TotalLiabilitiesBOYAmt': 1, 
    'TotalLiabilitiesEOY': 1, 
    'TotalLiabilitiesEOYAmt': 1, 
    'TotalNbrEmployees': 1, 
    'TotalNbrVolunteers': 1, 
    'TotalOtherCompensation': 1, 
    'TotalOtherCompensationAmt': 1, 
    'TotalOtherRevenue': 1, 
    'TotalProfFundrsngExpCY': 1, 
    'TotalProfFundrsngExpPriorYear': 1, 
    'TotalProgramServiceRevenue': 1, 
    'TotalProgramServiceRevenueAmt': 1, 
    'TotalReportableCompFrmRltdOrgs': 1, 
    'TotalReportableCompFromOrg': 1, 
    'TotalReportableCompFromOrgAmt': 1, 
    'TotalRevenue': 1, 
    'TotalRevenueCurrentYear': 1, 
    'TotalRevenueGrp': 1, 
    'TotalRevenuePriorYear': 1, 
    'TotalVolunteersCnt': 1, 
    'TypeOfOrgOtherDescription': 1, 
    'TypeOfOrganizationAssocInd': 1, 
    'TypeOfOrganizationAssociation': 1, 
    'TypeOfOrganizationCorpInd': 1, 
    'TypeOfOrganizationCorporation': 1, 
    'TypeOfOrganizationOther': 1, 
    'TypeOfOrganizationOtherInd': 1, 
    'TypeOfOrganizationTrust': 1, 
    'TypeOfOrganizationTrustInd': 1, 
    'UnsecuredNotesLoansPayable': 1, 
    'UnsecuredNotesLoansPayableGrp': 1, 
    'UponRequest': 1, 
    'UponRequestInd': 1, 
    'VotingMembersGoverningBodyCnt': 1, 
    'VotingMembersIndependentCnt': 1, 
    'WebSite': 1, 
    'WebsiteAddressTxt': 1, 
    'WhistleblowerPolicy': 1, 
    'WhistleblowerPolicyInd': 1, 
    'WrittenPolicyOrProcedure': 1, 
    'WrittenPolicyOrProcedureInd': 1, 
    'YearFormation': 1})

# THIS WORKED!

In [24]:
%%time
df = pd.DataFrame()
for batch in batched(cursor, 100000):
    df = df.append(batch, ignore_index=True)
df[:1]

Wall time: 1h 33min 37s


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,ReturnHeader,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,TypeOfOrganizationCorporation,YearFormation,StateLegalDomicile,ActivityOrMissionDescription,NbrVotingMembersGoverningBody,NbrIndependentVotingMembers,TotalNbrEmployees,TotalNbrVolunteers,TotalGrossUBI,NetUnrelatedBusinessTxblIncome,ContributionsGrantsPriorYear,ContributionsGrantsCurrentYear,ProgramServiceRevenuePriorYear,ProgramServiceRevenueCY,InvestmentIncomePriorYear,InvestmentIncomeCurrentYear,OtherRevenuePriorYear,OtherRevenueCurrentYear,TotalRevenuePriorYear,TotalRevenueCurrentYear,GrantsAndSimilarAmntsPriorYear,GrantsAndSimilarAmntsCY,BenefitsPaidToMembersPriorYear,BenefitsPaidToMembersCY,SalariesEtcPriorYear,SalariesEtcCurrentYear,TotalProfFundrsngExpPriorYear,TotalProfFundrsngExpCY,TotalFundrsngExpCurrentYear,OtherExpensePriorYear,OtherExpensesCurrentYear,TotalExpensesPriorYear,TotalExpensesCurrentYear,RevenuesLessExpensesPriorYear,RevenuesLessExpensesCY,TotalAssetsBOY,TotalAssetsEOY,TotalLiabilitiesBOY,TotalLiabilitiesEOY,NetAssetsOrFundBalancesBOY,NetAssetsOrFundBalancesEOY,ProfessionalFundraising,FundraisingActivities,Gaming,NbrVotingGoverningBodyMembers,NumberIndependentVotingMembers,FamilyOrBusinessRelationship,DelegationOfManagementDuties,ChangesToOrganizingDocs,MaterialDiversionOrMisuse,MembersOrStockholders,ElectionOfBoardMembers,DecisionsSubjectToApproval,MinutesOfGoverningBody,MinutesOfCommittees,OfficerMailingAddress,LocalChapters,Form990ProvidedToGoverningBody,ConflictOfInterestPolicy,AnnualDisclosureCoveredPersons,RegularMonitoringEnforcement,WhistleblowerPolicy,DocumentRetentionPolicy,CompensationProcessCEO,CompensationProcessOther,InvestmentInJointVenture,StatesWhereCopyOfReturnIsFiled,UponRequest,NoListedPersonsCompensated,TotalReportableCompFromOrg,TotalReportableCompFrmRltdOrgs,TotalOtherCompensation,NumberIndividualsGT100K,FormersListed,TotalCompGT150K,CompensationFromOtherSources,NumberOfContractorsGT100K,AllOtherContributions,TotalContributions,TotalOtherRevenue,TotalRevenue,FeesForServicesLegal,FeesForServicesAccounting,TotalFunctionalExpenses,SavingsAndTempCashInvestments,LandBuildingsEquipmentBasis,LandBldgEquipmentAccumDeprec,TotalAssets,OtherLiabilities,FollowSFAS117,ReconcilationRevenueExpenses,MethodOfAccountingAccrual,AccountantCompileOrReview,FSAudited,AuditCommittee,FederalGrantAuditRequired,AllAffiliatesIncluded,GroupExemptionNumber,PoliciesReferenceChapters,WrittenPolicyOrProcedure,TotalProgramServiceRevenue,CompCurrentOfficersDirectors,CompDisqualPersons,OtherSalariesAndWages,PensionPlanContributions,OtherEmployeeBenefits,PayrollTaxes,FeesForServicesManagement,FeesForServicesLobbying,FeesForServicesProfFundraising,FeesForServicesInvstMgmntFees,FeesForServicesOther,CashNonInterestBearing,MortNotesPyblSecuredInvestProp,FederalGrantAuditPerformed,LoansFromOfficersDirectors,MethodOfAccountingCash,TaxExemptBondLiabilities,OtherWebsite,FundraisingEvents,CntrbtnsRprtdFundraisingEvents,RelatedOrganizations,GrossIncomeFundraisingEvents,FundraisingDirectExpenses,FederatedCampaigns,GovernmentGrants,MethodOfAccountingOther,GrossSalesOfInventory,CostOfGoodsSold,DoNotFollowSFAS117,RetainedEarningsEndowmentEtc,InitialReturn,MembershipDues,GrossIncomeGaming,GamingDirectExpenses,NoncashContributions,OwnWebsite,UnsecuredNotesLoansPayable,TypeOfOrganizationOther,Organization501c,TypeOfOrganizationTrust,TypeOfOrganizationAssociation,CountryLegalDomicile,AmendedReturn,TypeOfOrgOtherDescription,TerminatedReturn,TerminationOrContraction,SpecialConditionDescription,Organization4947a1,ReconciliationUnrealizedInvest,ReconcilationPriorAdjustment,ReconcilationDonatedServices,ReconcilationInvestExpenses,PrincipalOfficerNm,GrossReceiptsAmt,GroupReturnForAffiliatesInd,Organization501c3Ind,TypeOfOrganizationCorpInd,FormationYr,LegalDomicileStateCd,ActivityOrMissionDesc,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,TotalEmployeeCnt,TotalGrossUBIAmt,CYContributionsGrantsAmt,CYProgramServiceRevenueAmt,CYInvestmentIncomeAmt,CYOtherRevenueAmt,CYTotalRevenueAmt,CYGrantsAndSimilarPaidAmt,CYBenefitsPaidToMembersAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalProfFndrsngExpnsAmt,CYTotalFundraisingExpenseAmt,CYOtherExpensesAmt,CYTotalExpensesAmt,CYRevenuesLessExpensesAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalLiabilitiesEOYAmt,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,ProfessionalFundraisingInd,FundraisingActivitiesInd,GamingActivitiesInd,GoverningBodyVotingMembersCnt,IndependentVotingMemberCnt,FamilyOrBusinessRlnInd,DelegationOfMgmtDutiesInd,ChangeToOrgDocumentsInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,ElectionOfBoardMembersInd,DecisionsSubjectToApprovaInd,MinutesOfGoverningBodyInd,MinutesOfCommitteesInd,OfficerMailingAddressInd,LocalChaptersInd,Form990ProvidedToGvrnBodyInd,ConflictOfInterestPolicyInd,WhistleblowerPolicyInd,DocumentRetentionPolicyInd,CompensationProcessCEOInd,CompensationProcessOtherInd,InvestmentInJointVentureInd,StatesWhereCopyOfReturnIsFldCd,NoListedPersonsCompensatedInd,FormerOfcrEmployeesListedInd,TotalCompGreaterThan150KInd,CompensationFromOtherSrcsInd,MembershipDuesAmt,FundraisingAmt,AllOtherContributionsAmt,TotalContributionsAmt,OtherRevenueTotalAmt,TotalRevenueGrp,FeesForServicesAccountingGrp,TotalFunctionalExpensesGrp,CashNonInterestBearingGrp,TotalAssetsGrp,OrgDoesNotFollowSFAS117Ind,RtnEarnEndowmentIncmOthFndsGrp,ReconcilationRevenueExpnssAmt,MethodOfAccountingCashInd,AccountantCompileOrReviewInd,FSAuditedInd,FederalGrantAuditRequiredInd,WebsiteAddressTxt,TotalVolunteersCnt,NetUnrelatedBusTxblIncmAmt,PYContributionsGrantsAmt,PYProgramServiceRevenueAmt,PYInvestmentIncomeAmt,PYOtherRevenueAmt,PYTotalRevenueAmt,PYGrantsAndSimilarPaidAmt,PYBenefitsPaidToMembersAmt,PYSalariesCompEmpBnftPaidAmt,PYTotalProfFndrsngExpnsAmt,PYOtherExpensesAmt,PYTotalExpensesAmt,PYRevenuesLessExpensesAmt,TotalLiabilitiesBOYAmt,AnnualDisclosureCoveredPrsnInd,RegularMonitoringEnfrcInd,UponRequestInd,TotalReportableCompFromOrgAmt,TotReportableCompRltdOrgAmt,TotalOtherCompensationAmt,IndivRcvdGreaterThan100KCnt,CntrctRcvdGreaterThan100KCnt,GovernmentGrantsAmt,TotalProgramServiceRevenueAmt,FundraisingGrossIncomeAmt,ContriRptFundraisingEventAmt,FundraisingDirectExpensesAmt,GrossSalesOfInventoryAmt,CostOfGoodsSoldAmt,CompCurrentOfcrDirectorsGrp,OtherSalariesAndWagesGrp,PensionPlanContributionsGrp,OtherEmployeeBenefitsGrp,PayrollTaxesGrp,FeesForServicesOtherGrp,SavingsAndTempCashInvstGrp,LandBldgEquipCostOrOtherBssAmt,LandBldgEquipAccumDeprecAmt,MortgNotesPyblScrdInvstPropGrp,OtherLiabilitiesGrp,OrganizationFollowsSFAS117Ind,NetUnrlzdGainsLossesInvstAmt,AuditCommitteeInd,AllAffiliatesIncludedInd,CompDisqualPersonsGrp,FeesForServicesManagementGrp,FeesForServicesLegalGrp,FeesForServicesLobbyingGrp,FeesForSrvcInvstMgmntFeesGrp,MethodOfAccountingAccrualInd,NoncashContributionsAmt,TaxExemptBondLiabilitiesGrp,LoansFromOfficersDirectorsGrp,UnsecuredNotesLoansPayableGrp,PriorPeriodAdjustmentsAmt,FederalGrantAuditPerformedInd,PoliciesReferenceChaptersInd,OtherWebsiteInd,AddressChangeInd,WrittenPolicyOrProcedureInd,RelatedOrganizationsAmt,OwnWebsiteInd,DonatedServicesAndUseFcltsAmt,LegalDomicileCountryCd,TypeOfOrganizationTrustInd,FinalReturnInd,ContractTerminationInd,GroupExemptionNum,FederatedCampaignsAmt,TypeOfOrganizationOtherInd,OtherOrganizationDsc,TypeOfOrganizationAssocInd,InitialReturnInd,GamingGrossIncomeAmt,GamingDirectExpensesAmt,MethodOfAccountingOtherInd,InvestmentExpenseAmt,Organization501cInd,Organization4947a1NotPFInd,AmendedReturnInd,SpecialConditionDesc
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,"{'@binaryAttachmentCount': '0', 'Timestamp': '2011-11-09T06:41:09-06:00', 'TaxPeriodEndDate': '2010-12-31', 'PreparerFirm': {'PreparerFirmBusinessName': {'BusinessNameLine1': 'CONCANNON MILLER & CO PC'}, 'PreparerFirmUSAddress': {'AddressLine1': ...",X,MICHAEL ANTON,1473903,0,X,,X,1992,PA,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,10,10,0,0,0,0,1044925,1439340,0,0,30447,33563,0,1000,1075372,1473903,638637,925000,0,0,0,0,0,0,195892,243131,459751,881768,1384751,193604,89152,1925215,2440859,171810,450430,1753405,1990429,0,0,0,10,10,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,0,0,"[PA, NJ, DE]",X,X,0,0,0,0,0,0,0,0,1439340,1439340,1000,"{'TotalRevenueColumn': '1473903', 'RelatedOrExemptFunctionIncome': '1000', 'UnrelatedBusinessRevenue': '0', 'ExclusionAmount': '33563'}","{'Total': '215', 'ManagementAndGeneral': '215'}","{'Total': '21675', 'ManagementAndGeneral': '21675'}","{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}","{'BOY': '332660', 'EOY': '270700'}",256845,86228,"{'BOY': '1925215', 'EOY': '2440859'}","{'BOY': '51640', 'EOY': '240077'}",X,89152,X,0,1,1,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


#### Save DF

In [78]:
cd "C:\\Users\\Gregory\\IRS 990 Control Variables\\"

C:\Users\Gregory\IRS 990 Control Variables


In [79]:
%%time
df.to_pickle('all filings nov. 2020 - all control variables.pkl.gz', compression='gzip')

Wall time: 26min 48s


# Note 1/1/2021
One row has several missing variables fixed in two subsequent notebooks. See the next few code blocks. Fix in next iteration.

Not sure why this happened. The index file has the correct information, and this should have just been inserted. 
- {"EIN":"204814407","TaxPeriod":"201812","DLN":"93493319065509","FormType":"990","URL":"https://s3.amazonaws.com/irs-form-990/201903199349306550_public.xml","OrganizationName":"PLAY FLAG FOOTBALL","SubmittedOn":"2020-01-17","ObjectId":"201903199349306550","LastUpdated":"2020-07-28T16:00:13"}

In [29]:
df.loc[[1895015]]

Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,ReturnHeader,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,TypeOfOrganizationCorporation,YearFormation,StateLegalDomicile,ActivityOrMissionDescription,NbrVotingMembersGoverningBody,NbrIndependentVotingMembers,TotalNbrEmployees,TotalNbrVolunteers,TotalGrossUBI,NetUnrelatedBusinessTxblIncome,ContributionsGrantsPriorYear,ContributionsGrantsCurrentYear,ProgramServiceRevenuePriorYear,ProgramServiceRevenueCY,InvestmentIncomePriorYear,InvestmentIncomeCurrentYear,OtherRevenuePriorYear,OtherRevenueCurrentYear,TotalRevenuePriorYear,TotalRevenueCurrentYear,GrantsAndSimilarAmntsPriorYear,GrantsAndSimilarAmntsCY,BenefitsPaidToMembersPriorYear,BenefitsPaidToMembersCY,SalariesEtcPriorYear,SalariesEtcCurrentYear,TotalProfFundrsngExpPriorYear,TotalProfFundrsngExpCY,TotalFundrsngExpCurrentYear,OtherExpensePriorYear,OtherExpensesCurrentYear,TotalExpensesPriorYear,TotalExpensesCurrentYear,RevenuesLessExpensesPriorYear,RevenuesLessExpensesCY,TotalAssetsBOY,TotalAssetsEOY,TotalLiabilitiesBOY,TotalLiabilitiesEOY,NetAssetsOrFundBalancesBOY,NetAssetsOrFundBalancesEOY,ProfessionalFundraising,FundraisingActivities,Gaming,NbrVotingGoverningBodyMembers,NumberIndependentVotingMembers,FamilyOrBusinessRelationship,DelegationOfManagementDuties,ChangesToOrganizingDocs,MaterialDiversionOrMisuse,MembersOrStockholders,ElectionOfBoardMembers,DecisionsSubjectToApproval,MinutesOfGoverningBody,MinutesOfCommittees,OfficerMailingAddress,LocalChapters,Form990ProvidedToGoverningBody,ConflictOfInterestPolicy,AnnualDisclosureCoveredPersons,RegularMonitoringEnforcement,WhistleblowerPolicy,DocumentRetentionPolicy,CompensationProcessCEO,CompensationProcessOther,InvestmentInJointVenture,StatesWhereCopyOfReturnIsFiled,UponRequest,NoListedPersonsCompensated,TotalReportableCompFromOrg,TotalReportableCompFrmRltdOrgs,TotalOtherCompensation,NumberIndividualsGT100K,FormersListed,TotalCompGT150K,CompensationFromOtherSources,NumberOfContractorsGT100K,AllOtherContributions,TotalContributions,TotalOtherRevenue,TotalRevenue,FeesForServicesLegal,FeesForServicesAccounting,TotalFunctionalExpenses,SavingsAndTempCashInvestments,LandBuildingsEquipmentBasis,LandBldgEquipmentAccumDeprec,TotalAssets,OtherLiabilities,FollowSFAS117,ReconcilationRevenueExpenses,MethodOfAccountingAccrual,AccountantCompileOrReview,FSAudited,AuditCommittee,FederalGrantAuditRequired,AllAffiliatesIncluded,GroupExemptionNumber,PoliciesReferenceChapters,WrittenPolicyOrProcedure,TotalProgramServiceRevenue,CompCurrentOfficersDirectors,CompDisqualPersons,OtherSalariesAndWages,PensionPlanContributions,OtherEmployeeBenefits,PayrollTaxes,FeesForServicesManagement,FeesForServicesLobbying,FeesForServicesProfFundraising,FeesForServicesInvstMgmntFees,FeesForServicesOther,CashNonInterestBearing,MortNotesPyblSecuredInvestProp,FederalGrantAuditPerformed,LoansFromOfficersDirectors,MethodOfAccountingCash,TaxExemptBondLiabilities,OtherWebsite,FundraisingEvents,CntrbtnsRprtdFundraisingEvents,RelatedOrganizations,GrossIncomeFundraisingEvents,FundraisingDirectExpenses,FederatedCampaigns,GovernmentGrants,MethodOfAccountingOther,GrossSalesOfInventory,CostOfGoodsSold,DoNotFollowSFAS117,RetainedEarningsEndowmentEtc,InitialReturn,MembershipDues,GrossIncomeGaming,GamingDirectExpenses,NoncashContributions,OwnWebsite,UnsecuredNotesLoansPayable,TypeOfOrganizationOther,Organization501c,TypeOfOrganizationTrust,TypeOfOrganizationAssociation,CountryLegalDomicile,AmendedReturn,TypeOfOrgOtherDescription,TerminatedReturn,TerminationOrContraction,SpecialConditionDescription,Organization4947a1,ReconciliationUnrealizedInvest,ReconcilationPriorAdjustment,ReconcilationDonatedServices,ReconcilationInvestExpenses,PrincipalOfficerNm,GrossReceiptsAmt,GroupReturnForAffiliatesInd,Organization501c3Ind,TypeOfOrganizationCorpInd,FormationYr,LegalDomicileStateCd,ActivityOrMissionDesc,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,TotalEmployeeCnt,TotalGrossUBIAmt,CYContributionsGrantsAmt,CYProgramServiceRevenueAmt,CYInvestmentIncomeAmt,CYOtherRevenueAmt,CYTotalRevenueAmt,CYGrantsAndSimilarPaidAmt,CYBenefitsPaidToMembersAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalProfFndrsngExpnsAmt,CYTotalFundraisingExpenseAmt,CYOtherExpensesAmt,CYTotalExpensesAmt,CYRevenuesLessExpensesAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalLiabilitiesEOYAmt,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,ProfessionalFundraisingInd,FundraisingActivitiesInd,GamingActivitiesInd,GoverningBodyVotingMembersCnt,IndependentVotingMemberCnt,FamilyOrBusinessRlnInd,DelegationOfMgmtDutiesInd,ChangeToOrgDocumentsInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,ElectionOfBoardMembersInd,DecisionsSubjectToApprovaInd,MinutesOfGoverningBodyInd,MinutesOfCommitteesInd,OfficerMailingAddressInd,LocalChaptersInd,Form990ProvidedToGvrnBodyInd,ConflictOfInterestPolicyInd,WhistleblowerPolicyInd,DocumentRetentionPolicyInd,CompensationProcessCEOInd,CompensationProcessOtherInd,InvestmentInJointVentureInd,StatesWhereCopyOfReturnIsFldCd,NoListedPersonsCompensatedInd,FormerOfcrEmployeesListedInd,TotalCompGreaterThan150KInd,CompensationFromOtherSrcsInd,MembershipDuesAmt,FundraisingAmt,AllOtherContributionsAmt,TotalContributionsAmt,OtherRevenueTotalAmt,TotalRevenueGrp,FeesForServicesAccountingGrp,TotalFunctionalExpensesGrp,CashNonInterestBearingGrp,TotalAssetsGrp,OrgDoesNotFollowSFAS117Ind,RtnEarnEndowmentIncmOthFndsGrp,ReconcilationRevenueExpnssAmt,MethodOfAccountingCashInd,AccountantCompileOrReviewInd,FSAuditedInd,FederalGrantAuditRequiredInd,WebsiteAddressTxt,TotalVolunteersCnt,NetUnrelatedBusTxblIncmAmt,PYContributionsGrantsAmt,PYProgramServiceRevenueAmt,PYInvestmentIncomeAmt,PYOtherRevenueAmt,PYTotalRevenueAmt,PYGrantsAndSimilarPaidAmt,PYBenefitsPaidToMembersAmt,PYSalariesCompEmpBnftPaidAmt,PYTotalProfFndrsngExpnsAmt,PYOtherExpensesAmt,PYTotalExpensesAmt,PYRevenuesLessExpensesAmt,TotalLiabilitiesBOYAmt,AnnualDisclosureCoveredPrsnInd,RegularMonitoringEnfrcInd,UponRequestInd,TotalReportableCompFromOrgAmt,TotReportableCompRltdOrgAmt,TotalOtherCompensationAmt,IndivRcvdGreaterThan100KCnt,CntrctRcvdGreaterThan100KCnt,GovernmentGrantsAmt,TotalProgramServiceRevenueAmt,FundraisingGrossIncomeAmt,ContriRptFundraisingEventAmt,FundraisingDirectExpensesAmt,GrossSalesOfInventoryAmt,CostOfGoodsSoldAmt,CompCurrentOfcrDirectorsGrp,OtherSalariesAndWagesGrp,PensionPlanContributionsGrp,OtherEmployeeBenefitsGrp,PayrollTaxesGrp,FeesForServicesOtherGrp,SavingsAndTempCashInvstGrp,LandBldgEquipCostOrOtherBssAmt,LandBldgEquipAccumDeprecAmt,MortgNotesPyblScrdInvstPropGrp,OtherLiabilitiesGrp,OrganizationFollowsSFAS117Ind,NetUnrlzdGainsLossesInvstAmt,AuditCommitteeInd,AllAffiliatesIncludedInd,CompDisqualPersonsGrp,FeesForServicesManagementGrp,FeesForServicesLegalGrp,FeesForServicesLobbyingGrp,FeesForSrvcInvstMgmntFeesGrp,MethodOfAccountingAccrualInd,NoncashContributionsAmt,TaxExemptBondLiabilitiesGrp,LoansFromOfficersDirectorsGrp,UnsecuredNotesLoansPayableGrp,PriorPeriodAdjustmentsAmt,FederalGrantAuditPerformedInd,PoliciesReferenceChaptersInd,OtherWebsiteInd,AddressChangeInd,WrittenPolicyOrProcedureInd,RelatedOrganizationsAmt,OwnWebsiteInd,DonatedServicesAndUseFcltsAmt,LegalDomicileCountryCd,TypeOfOrganizationTrustInd,FinalReturnInd,ContractTerminationInd,GroupExemptionNum,FederatedCampaignsAmt,TypeOfOrganizationOtherInd,OtherOrganizationDsc,TypeOfOrganizationAssocInd,InitialReturnInd,GamingGrossIncomeAmt,GamingDirectExpensesAmt,MethodOfAccountingOtherInd,InvestmentExpenseAmt,Organization501cInd,Organization4947a1NotPFInd,AmendedReturnInd,SpecialConditionDesc
1895015,,,,,,"{'@binaryAttachmentCnt': '0', 'ReturnTs': '2019-11-15T10:29:26-06:00', 'TaxPeriodEndDt': '2018-12-31', 'PreparerFirmGrp': {'PreparerFirmEIN': '770051130', 'PreparerFirmName': {'BusinessNameLine1Txt': 'ABBOTT STRINGHAM & LYNCH'}, 'PreparerUSAddress': {'AddressLine1Txt': '1530 MERIDIAN AVE 2ND FLR', 'CityNm': 'SAN JOSE', 'StateAbbreviationCd': 'CA', 'ZIPCd': '95125'}}, 'ReturnTypeCd': '990', 'TaxPeriodBeginDt': '2018-01-01', 'Filer': {'EIN': '204814407', 'BusinessName': {'BusinessNameLine1Txt': 'PLAY FLAG FOOTBALL'}, 'BusinessNameControlTxt': 'PLAY', 'PhoneNum': '4083700500', 'USAddress': {'AddressLine1Txt': '545 WESTCHESTER DR NO A', 'CityNm': 'CAMPBELL', 'StateAbbreviationCd': 'CA', 'ZIPCd': '95008'}}, 'BusinessOfficerGrp': {'PersonNm': 'JOHN MORA', 'PersonTitleTxt': 'PRESIDENT', 'PhoneNum': '4083700500', 'SignatureDt': '2019-11-14', 'DiscussWithPaidPreparerInd': '1'}, 'PreparerPersonGrp': {'PreparerPersonNm': 'FRANK L BOITANO', 'PTIN': 'P00058069', 'PhoneNum': '4083778700', 'PreparationDt': '2019-11-14'}, 'FilingSecurityInformation': {'IPAddress': {'IPv4AddressTxt': '12.217.167.71'}, 'IPDt': '2019-11-14', 'IPTm': '12:01:59', 'IPTimezoneCd': 'PS', 'FilingLicenseTypeCd': 'P', 'AtSubmissionCreationDeviceId': 'F2CBB06AF0C0BFA6D697359176620E55FDB7D752', 'AtSubmissionFilingDeviceId': 'A3800E77008D47A4C38094878DB0F53CA65DCD5F'}, 'TaxYr': '2018', 'BuildTS': '2020-04-17 16:48:07Z'}",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,JOHN MORA,1075,0,X,X,2006,CA,OPERATING YOUTH SPORTS PROGRAMS FOR THE PUBLIC BENEFIT,3,0,0,0,0,1075,0,0,1075,0,0,0,0,0,10378,10378,-9303,32567,23264,0,32567,23264,0,0,0,3,0,0,0,0,0,0,0,0,1,1,0,0,0,1,1,1,0,0,0,CA,X,0,0,0,,,,,,"{'TotalRevenueColumnAmt': '1075', 'RelatedOrExemptFuncIncomeAmt': '1075', 'UnrelatedBusinessRevenueAmt': '0', 'ExclusionAmt': '0'}",,"{'TotalAmt': '10378', 'ProgramServicesAmt': '8649', 'ManagementAndGeneralAmt': '1729', 'FundraisingAmt': '0'}","{'BOYAmt': '6630', 'EOYAmt': '4245'}","{'BOYAmt': '32567', 'EOYAmt': '23264'}",,,-9303,X,0,0,0,WWW.PLAYFLAGFOOTBALL.COM,300,0,0,250,0,0,250,0,0,0,0,8092,8092,-7842,0,1,1,X,0,0,0,0,0,,1075,,,,,,,,,,,,,215130,196111,,,X,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [32]:
# *IRS 990 e-File Data -- CONTROL VARIABLES (A4-1) -- Change Data Types.ipynb*
df.loc[1895015, 'EIN'] = '204814407'

# *IRS 990 e-File Data -- CONTROL VARIABLES (A4-2) -- Fill in Missing Values.ipynb*
df.loc[1895015, 'OrganizationName'] = 'PLAY FLAG FOOTBALL'

# This notebook: *IRS 990 e-File Data -- CONTROL VARIABLES (A1) -- Extract All Variables (py36).ipynb*
df.loc[1895015, 'URL'] = 'https://s3.amazonaws.com/irs-form-990/201903199349306550_public.xml'
df.loc[1895015, 'DLN'] = '93493319065509'
df.loc[1895015, 'TaxPeriod'] = '201812'

df.loc[[1895015]]

Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,ReturnHeader,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,TypeOfOrganizationCorporation,YearFormation,StateLegalDomicile,ActivityOrMissionDescription,NbrVotingMembersGoverningBody,NbrIndependentVotingMembers,TotalNbrEmployees,TotalNbrVolunteers,TotalGrossUBI,NetUnrelatedBusinessTxblIncome,ContributionsGrantsPriorYear,ContributionsGrantsCurrentYear,ProgramServiceRevenuePriorYear,ProgramServiceRevenueCY,InvestmentIncomePriorYear,InvestmentIncomeCurrentYear,OtherRevenuePriorYear,OtherRevenueCurrentYear,TotalRevenuePriorYear,TotalRevenueCurrentYear,GrantsAndSimilarAmntsPriorYear,GrantsAndSimilarAmntsCY,BenefitsPaidToMembersPriorYear,BenefitsPaidToMembersCY,SalariesEtcPriorYear,SalariesEtcCurrentYear,TotalProfFundrsngExpPriorYear,TotalProfFundrsngExpCY,TotalFundrsngExpCurrentYear,OtherExpensePriorYear,OtherExpensesCurrentYear,TotalExpensesPriorYear,TotalExpensesCurrentYear,RevenuesLessExpensesPriorYear,RevenuesLessExpensesCY,TotalAssetsBOY,TotalAssetsEOY,TotalLiabilitiesBOY,TotalLiabilitiesEOY,NetAssetsOrFundBalancesBOY,NetAssetsOrFundBalancesEOY,ProfessionalFundraising,FundraisingActivities,Gaming,NbrVotingGoverningBodyMembers,NumberIndependentVotingMembers,FamilyOrBusinessRelationship,DelegationOfManagementDuties,ChangesToOrganizingDocs,MaterialDiversionOrMisuse,MembersOrStockholders,ElectionOfBoardMembers,DecisionsSubjectToApproval,MinutesOfGoverningBody,MinutesOfCommittees,OfficerMailingAddress,LocalChapters,Form990ProvidedToGoverningBody,ConflictOfInterestPolicy,AnnualDisclosureCoveredPersons,RegularMonitoringEnforcement,WhistleblowerPolicy,DocumentRetentionPolicy,CompensationProcessCEO,CompensationProcessOther,InvestmentInJointVenture,StatesWhereCopyOfReturnIsFiled,UponRequest,NoListedPersonsCompensated,TotalReportableCompFromOrg,TotalReportableCompFrmRltdOrgs,TotalOtherCompensation,NumberIndividualsGT100K,FormersListed,TotalCompGT150K,CompensationFromOtherSources,NumberOfContractorsGT100K,AllOtherContributions,TotalContributions,TotalOtherRevenue,TotalRevenue,FeesForServicesLegal,FeesForServicesAccounting,TotalFunctionalExpenses,SavingsAndTempCashInvestments,LandBuildingsEquipmentBasis,LandBldgEquipmentAccumDeprec,TotalAssets,OtherLiabilities,FollowSFAS117,ReconcilationRevenueExpenses,MethodOfAccountingAccrual,AccountantCompileOrReview,FSAudited,AuditCommittee,FederalGrantAuditRequired,AllAffiliatesIncluded,GroupExemptionNumber,PoliciesReferenceChapters,WrittenPolicyOrProcedure,TotalProgramServiceRevenue,CompCurrentOfficersDirectors,CompDisqualPersons,OtherSalariesAndWages,PensionPlanContributions,OtherEmployeeBenefits,PayrollTaxes,FeesForServicesManagement,FeesForServicesLobbying,FeesForServicesProfFundraising,FeesForServicesInvstMgmntFees,FeesForServicesOther,CashNonInterestBearing,MortNotesPyblSecuredInvestProp,FederalGrantAuditPerformed,LoansFromOfficersDirectors,MethodOfAccountingCash,TaxExemptBondLiabilities,OtherWebsite,FundraisingEvents,CntrbtnsRprtdFundraisingEvents,RelatedOrganizations,GrossIncomeFundraisingEvents,FundraisingDirectExpenses,FederatedCampaigns,GovernmentGrants,MethodOfAccountingOther,GrossSalesOfInventory,CostOfGoodsSold,DoNotFollowSFAS117,RetainedEarningsEndowmentEtc,InitialReturn,MembershipDues,GrossIncomeGaming,GamingDirectExpenses,NoncashContributions,OwnWebsite,UnsecuredNotesLoansPayable,TypeOfOrganizationOther,Organization501c,TypeOfOrganizationTrust,TypeOfOrganizationAssociation,CountryLegalDomicile,AmendedReturn,TypeOfOrgOtherDescription,TerminatedReturn,TerminationOrContraction,SpecialConditionDescription,Organization4947a1,ReconciliationUnrealizedInvest,ReconcilationPriorAdjustment,ReconcilationDonatedServices,ReconcilationInvestExpenses,PrincipalOfficerNm,GrossReceiptsAmt,GroupReturnForAffiliatesInd,Organization501c3Ind,TypeOfOrganizationCorpInd,FormationYr,LegalDomicileStateCd,ActivityOrMissionDesc,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,TotalEmployeeCnt,TotalGrossUBIAmt,CYContributionsGrantsAmt,CYProgramServiceRevenueAmt,CYInvestmentIncomeAmt,CYOtherRevenueAmt,CYTotalRevenueAmt,CYGrantsAndSimilarPaidAmt,CYBenefitsPaidToMembersAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalProfFndrsngExpnsAmt,CYTotalFundraisingExpenseAmt,CYOtherExpensesAmt,CYTotalExpensesAmt,CYRevenuesLessExpensesAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalLiabilitiesEOYAmt,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,ProfessionalFundraisingInd,FundraisingActivitiesInd,GamingActivitiesInd,GoverningBodyVotingMembersCnt,IndependentVotingMemberCnt,FamilyOrBusinessRlnInd,DelegationOfMgmtDutiesInd,ChangeToOrgDocumentsInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,ElectionOfBoardMembersInd,DecisionsSubjectToApprovaInd,MinutesOfGoverningBodyInd,MinutesOfCommitteesInd,OfficerMailingAddressInd,LocalChaptersInd,Form990ProvidedToGvrnBodyInd,ConflictOfInterestPolicyInd,WhistleblowerPolicyInd,DocumentRetentionPolicyInd,CompensationProcessCEOInd,CompensationProcessOtherInd,InvestmentInJointVentureInd,StatesWhereCopyOfReturnIsFldCd,NoListedPersonsCompensatedInd,FormerOfcrEmployeesListedInd,TotalCompGreaterThan150KInd,CompensationFromOtherSrcsInd,MembershipDuesAmt,FundraisingAmt,AllOtherContributionsAmt,TotalContributionsAmt,OtherRevenueTotalAmt,TotalRevenueGrp,FeesForServicesAccountingGrp,TotalFunctionalExpensesGrp,CashNonInterestBearingGrp,TotalAssetsGrp,OrgDoesNotFollowSFAS117Ind,RtnEarnEndowmentIncmOthFndsGrp,ReconcilationRevenueExpnssAmt,MethodOfAccountingCashInd,AccountantCompileOrReviewInd,FSAuditedInd,FederalGrantAuditRequiredInd,WebsiteAddressTxt,TotalVolunteersCnt,NetUnrelatedBusTxblIncmAmt,PYContributionsGrantsAmt,PYProgramServiceRevenueAmt,PYInvestmentIncomeAmt,PYOtherRevenueAmt,PYTotalRevenueAmt,PYGrantsAndSimilarPaidAmt,PYBenefitsPaidToMembersAmt,PYSalariesCompEmpBnftPaidAmt,PYTotalProfFndrsngExpnsAmt,PYOtherExpensesAmt,PYTotalExpensesAmt,PYRevenuesLessExpensesAmt,TotalLiabilitiesBOYAmt,AnnualDisclosureCoveredPrsnInd,RegularMonitoringEnfrcInd,UponRequestInd,TotalReportableCompFromOrgAmt,TotReportableCompRltdOrgAmt,TotalOtherCompensationAmt,IndivRcvdGreaterThan100KCnt,CntrctRcvdGreaterThan100KCnt,GovernmentGrantsAmt,TotalProgramServiceRevenueAmt,FundraisingGrossIncomeAmt,ContriRptFundraisingEventAmt,FundraisingDirectExpensesAmt,GrossSalesOfInventoryAmt,CostOfGoodsSoldAmt,CompCurrentOfcrDirectorsGrp,OtherSalariesAndWagesGrp,PensionPlanContributionsGrp,OtherEmployeeBenefitsGrp,PayrollTaxesGrp,FeesForServicesOtherGrp,SavingsAndTempCashInvstGrp,LandBldgEquipCostOrOtherBssAmt,LandBldgEquipAccumDeprecAmt,MortgNotesPyblScrdInvstPropGrp,OtherLiabilitiesGrp,OrganizationFollowsSFAS117Ind,NetUnrlzdGainsLossesInvstAmt,AuditCommitteeInd,AllAffiliatesIncludedInd,CompDisqualPersonsGrp,FeesForServicesManagementGrp,FeesForServicesLegalGrp,FeesForServicesLobbyingGrp,FeesForSrvcInvstMgmntFeesGrp,MethodOfAccountingAccrualInd,NoncashContributionsAmt,TaxExemptBondLiabilitiesGrp,LoansFromOfficersDirectorsGrp,UnsecuredNotesLoansPayableGrp,PriorPeriodAdjustmentsAmt,FederalGrantAuditPerformedInd,PoliciesReferenceChaptersInd,OtherWebsiteInd,AddressChangeInd,WrittenPolicyOrProcedureInd,RelatedOrganizationsAmt,OwnWebsiteInd,DonatedServicesAndUseFcltsAmt,LegalDomicileCountryCd,TypeOfOrganizationTrustInd,FinalReturnInd,ContractTerminationInd,GroupExemptionNum,FederatedCampaignsAmt,TypeOfOrganizationOtherInd,OtherOrganizationDsc,TypeOfOrganizationAssocInd,InitialReturnInd,GamingGrossIncomeAmt,GamingDirectExpensesAmt,MethodOfAccountingOtherInd,InvestmentExpenseAmt,Organization501cInd,Organization4947a1NotPFInd,AmendedReturnInd,SpecialConditionDesc
1895015,PLAY FLAG FOOTBALL,https://s3.amazonaws.com/irs-form-990/201903199349306550_public.xml,93493319065509,201812,204814407,"{'@binaryAttachmentCnt': '0', 'ReturnTs': '2019-11-15T10:29:26-06:00', 'TaxPeriodEndDt': '2018-12-31', 'PreparerFirmGrp': {'PreparerFirmEIN': '770051130', 'PreparerFirmName': {'BusinessNameLine1Txt': 'ABBOTT STRINGHAM & LYNCH'}, 'PreparerUSAddress': {'AddressLine1Txt': '1530 MERIDIAN AVE 2ND FLR', 'CityNm': 'SAN JOSE', 'StateAbbreviationCd': 'CA', 'ZIPCd': '95125'}}, 'ReturnTypeCd': '990', 'TaxPeriodBeginDt': '2018-01-01', 'Filer': {'EIN': '204814407', 'BusinessName': {'BusinessNameLine1Txt': 'PLAY FLAG FOOTBALL'}, 'BusinessNameControlTxt': 'PLAY', 'PhoneNum': '4083700500', 'USAddress': {'AddressLine1Txt': '545 WESTCHESTER DR NO A', 'CityNm': 'CAMPBELL', 'StateAbbreviationCd': 'CA', 'ZIPCd': '95008'}}, 'BusinessOfficerGrp': {'PersonNm': 'JOHN MORA', 'PersonTitleTxt': 'PRESIDENT', 'PhoneNum': '4083700500', 'SignatureDt': '2019-11-14', 'DiscussWithPaidPreparerInd': '1'}, 'PreparerPersonGrp': {'PreparerPersonNm': 'FRANK L BOITANO', 'PTIN': 'P00058069', 'PhoneNum': '4083778700', 'PreparationDt': '2019-11-14'}, 'FilingSecurityInformation': {'IPAddress': {'IPv4AddressTxt': '12.217.167.71'}, 'IPDt': '2019-11-14', 'IPTm': '12:01:59', 'IPTimezoneCd': 'PS', 'FilingLicenseTypeCd': 'P', 'AtSubmissionCreationDeviceId': 'F2CBB06AF0C0BFA6D697359176620E55FDB7D752', 'AtSubmissionFilingDeviceId': 'A3800E77008D47A4C38094878DB0F53CA65DCD5F'}, 'TaxYr': '2018', 'BuildTS': '2020-04-17 16:48:07Z'}",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,JOHN MORA,1075,0,X,X,2006,CA,OPERATING YOUTH SPORTS PROGRAMS FOR THE PUBLIC BENEFIT,3,0,0,0,0,1075,0,0,1075,0,0,0,0,0,10378,10378,-9303,32567,23264,0,32567,23264,0,0,0,3,0,0,0,0,0,0,0,0,1,1,0,0,0,1,1,1,0,0,0,CA,X,0,0,0,,,,,,"{'TotalRevenueColumnAmt': '1075', 'RelatedOrExemptFuncIncomeAmt': '1075', 'UnrelatedBusinessRevenueAmt': '0', 'ExclusionAmt': '0'}",,"{'TotalAmt': '10378', 'ProgramServicesAmt': '8649', 'ManagementAndGeneralAmt': '1729', 'FundraisingAmt': '0'}","{'BOYAmt': '6630', 'EOYAmt': '4245'}","{'BOYAmt': '32567', 'EOYAmt': '23264'}",,,-9303,X,0,0,0,WWW.PLAYFLAGFOOTBALL.COM,300,0,0,250,0,0,250,0,0,0,0,8092,8092,-7842,0,1,1,X,0,0,0,0,0,,1075,,,,,,,,,,,,,215130,196111,,,X,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


# ENDED HERE 2/24/2020

In [81]:
df['ReturnHeader'].sample(1)

743750    {'@binaryAttachmentCnt': '0', 'ReturnTs': '2015-04-10T15:38:15-05:00', 'TaxPeriodEndDt': '2014-10-31', 'PreparerFirmGrp': {'PreparerFirmEIN': '990303190', 'PreparerFirmName': {'BusinessNameLine1': 'CARBONARO CPAS & MANAGEMENT GROUP'}, 'PreparerUS...
Name: ReturnHeader, dtype: object

# Drop duplicates
There are none in this round.

In [82]:
%%time
duplicateRowsDF = df[df.duplicated(['URL'])]
print("Number of columns:", len(duplicateRowsDF.columns))
print("Number of observations:", len(duplicateRowsDF))
duplicateRowsDF = duplicateRowsDF.sort_values('URL')
duplicateRowsDF[['URL', 'OrganizationName', 'ReturnHeader']][:2]

Number of columns: 323
Number of observations: 0


Unnamed: 0,URL,OrganizationName,ReturnHeader


In [83]:
#df[df['URL'].isin(['https://s3.amazonaws.com/irs-form-990/201100299349300805_public.xml',
#                   'https://s3.amazonaws.com/irs-form-990/201100319349300400_public.xml',
#                  ])][['URL', 'OrganizationName', 'ReturnHeader']]

Unnamed: 0,URL,OrganizationName,ReturnHeader
37314,https://s3.amazonaws.com/irs-form-990/201100299349300805_public.xml,ANGELA ARKELL MITCHELL FOUNDATION FOR LITERATURE INC,"{'@binaryAttachmentCount': '0', 'Timestamp': '2011-01-29T15:48:25-06:00', 'TaxPeriodEndDate': '2010-12-31', 'PreparerFirm': {'PreparerFirmBusinessName': {'BusinessNameLine1': 'TISDALE CPA'}, 'PreparerFirmUSAddress': {'AddressLine1': '75 JUNCTION ..."
37315,https://s3.amazonaws.com/irs-form-990/201100319349300400_public.xml,PLUMBERS AND PIPEFITTERS LU 495 JAC,"{'@binaryAttachmentCount': '0', 'Timestamp': '2011-01-31T09:39:58-05:00', 'TaxPeriodEndDate': '2010-12-31', 'PreparerFirm': {'PreparerFirmBusinessName': {'BusinessNameLine1': 'CHARLES E WILLIAMS CPA'}, 'PreparerFirmUSAddress': {'AddressLine1': 'P..."


In [38]:
#df = df.sort_values('URL')

In [39]:
#df[df['URL'].isin(['https://s3.amazonaws.com/irs-form-990/201100299349300805_public.xml',
#                   'https://s3.amazonaws.com/irs-form-990/201100319349300400_public.xml',
#                  ])]

Unnamed: 0,AccountantCompileOrReview,ActivityOrMissionDescription,AddressChange,AllAffiliatesIncluded,AllOtherContributions,AmendedReturn,AnnualDisclosureCoveredPersons,AuditCommittee,BenefitsPaidToMembersCY,BenefitsPaidToMembersPriorYear,CashNonInterestBearing,ChangesToOrganizingDocs,CntrbtnsRprtdFundraisingEvents,CompCurrentOfficersDirectors,CompDisqualPersons,CompensationFromOtherSources,CompensationProcessCEO,CompensationProcessOther,ConflictOfInterestPolicy,ContributionsGrantsCurrentYear,ContributionsGrantsPriorYear,CostOfGoodsSold,CountryLegalDomicile,DLN,DecisionsSubjectToApproval,DelegationOfManagementDuties,DoNotFollowSFAS117,DocumentRetentionPolicy,EIN,ElectionOfBoardMembers,FSAudited,FamilyOrBusinessRelationship,FederalGrantAuditPerformed,FederalGrantAuditRequired,FederatedCampaigns,FeesForServicesAccounting,FeesForServicesInvstMgmntFees,FeesForServicesLegal,FeesForServicesLobbying,FeesForServicesManagement,FeesForServicesOther,FeesForServicesProfFundraising,FollowSFAS117,Form990ProvidedToGoverningBody,FormersListed,FundraisingDirectExpenses,FundraisingEvents,GamingDirectExpenses,GovernmentGrants,GrantsAndSimilarAmntsCY,GrantsAndSimilarAmntsPriorYear,GrossIncomeFundraisingEvents,GrossIncomeGaming,GrossReceipts,GrossSalesOfInventory,GroupExemptionNumber,GroupReturnForAffiliates,InitialReturn,InvestmentInJointVenture,InvestmentIncomeCurrentYear,InvestmentIncomePriorYear,LandBldgEquipmentAccumDeprec,LandBuildingsEquipmentBasis,LoansFromOfficersDirectors,LocalChapters,MaterialDiversionOrMisuse,MembersOrStockholders,MembershipDues,MethodOfAccountingAccrual,MethodOfAccountingCash,MethodOfAccountingOther,MinutesOfCommittees,MinutesOfGoverningBody,MortNotesPyblSecuredInvestProp,NameOfPrincipalOfficerPerson,NbrIndependentVotingMembers,NbrVotingGoverningBodyMembers,NbrVotingMembersGoverningBody,NetAssetsOrFundBalancesBOY,NetAssetsOrFundBalancesEOY,NetUnrelatedBusinessTxblIncome,NoListedPersonsCompensated,NoncashContributions,NumberIndependentVotingMembers,NumberIndividualsGT100K,NumberOfContractorsGT100K,OfficerMailingAddress,Organization4947a1,Organization501c,Organization501c3,OrganizationName,OtherEmployeeBenefits,OtherExpensePriorYear,OtherExpensesCurrentYear,OtherLiabilities,OtherRevenueCurrentYear,OtherRevenuePriorYear,OtherSalariesAndWages,OtherWebsite,OwnWebsite,PayrollTaxes,PensionPlanContributions,PoliciesReferenceChapters,ProgramServiceRevenueCY,ProgramServiceRevenuePriorYear,ReconcilationRevenueExpenses,RegularMonitoringEnforcement,RelatedOrganizations,RetainedEarningsEndowmentEtc,ReturnHeader,RevenuesLessExpensesCY,RevenuesLessExpensesPriorYear,SalariesEtcCurrentYear,SalariesEtcPriorYear,SavingsAndTempCashInvestments,SpecialConditionDescription,StateLegalDomicile,StatesWhereCopyOfReturnIsFiled,TaxExemptBondLiabilities,TaxPeriod,TerminatedReturn,TerminationOrContraction,TotalAssets,TotalAssetsBOY,TotalAssetsEOY,TotalCompGT150K,TotalContributions,TotalExpensesCurrentYear,TotalExpensesPriorYear,TotalFunctionalExpenses,TotalFundrsngExpCurrentYear,TotalGrossUBI,TotalLiabilitiesBOY,TotalLiabilitiesEOY,TotalNbrEmployees,TotalNbrVolunteers,TotalOtherCompensation,TotalOtherRevenue,TotalProfFundrsngExpCY,TotalProfFundrsngExpPriorYear,TotalProgramServiceRevenue,TotalReportableCompFrmRltdOrgs,TotalReportableCompFromOrg,TotalRevenue,TotalRevenueCurrentYear,TotalRevenuePriorYear,TypeOfOrgOtherDescription,TypeOfOrganizationAssociation,TypeOfOrganizationCorporation,TypeOfOrganizationOther,TypeOfOrganizationTrust,URL,UnsecuredNotesLoansPayable,UponRequest,WebSite,WhistleblowerPolicy,WrittenPolicyOrProcedure,YearFormation,ReconcilationDonatedServices,ReconcilationInvestExpenses,ReconcilationPriorAdjustment,ReconciliationUnrealizedInvest,AccountantCompileOrReviewInd,ActivityOrMissionDesc,AddressChangeInd,AllAffiliatesIncludedInd,AllOtherContributionsAmt,AmendedReturnInd,AnnualDisclosureCoveredPrsnInd,AuditCommitteeInd,CYBenefitsPaidToMembersAmt,CYContributionsGrantsAmt,CYGrantsAndSimilarPaidAmt,CYInvestmentIncomeAmt,CYOtherExpensesAmt,CYOtherRevenueAmt,CYProgramServiceRevenueAmt,CYRevenuesLessExpensesAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalExpensesAmt,CYTotalFundraisingExpenseAmt,CYTotalProfFndrsngExpnsAmt,CYTotalRevenueAmt,CashNonInterestBearingGrp,ChangeToOrgDocumentsInd,CntrctRcvdGreaterThan100KCnt,CompCurrentOfcrDirectorsGrp,CompDisqualPersonsGrp,CompensationFromOtherSrcsInd,CompensationProcessCEOInd,CompensationProcessOtherInd,ConflictOfInterestPolicyInd,ContractTerminationInd,ContriRptFundraisingEventAmt,CostOfGoodsSoldAmt,DecisionsSubjectToApprovaInd,DelegationOfMgmtDutiesInd,DocumentRetentionPolicyInd,DonatedServicesAndUseFcltsAmt,ElectionOfBoardMembersInd,FSAuditedInd,FamilyOrBusinessRlnInd,FederalGrantAuditPerformedInd,FederalGrantAuditRequiredInd,FederatedCampaignsAmt,FeesForServicesAccountingGrp,FeesForServicesLegalGrp,FeesForServicesLobbyingGrp,FeesForServicesManagementGrp,FeesForServicesOtherGrp,FeesForSrvcInvstMgmntFeesGrp,FinalReturnInd,Form990ProvidedToGvrnBodyInd,FormationYr,FormerOfcrEmployeesListedInd,FundraisingAmt,FundraisingDirectExpensesAmt,FundraisingGrossIncomeAmt,GamingDirectExpensesAmt,GamingGrossIncomeAmt,GoverningBodyVotingMembersCnt,GovernmentGrantsAmt,GrossReceiptsAmt,GrossSalesOfInventoryAmt,GroupExemptionNum,GroupReturnForAffiliatesInd,IndependentVotingMemberCnt,IndivRcvdGreaterThan100KCnt,InitialReturnInd,InvestmentExpenseAmt,InvestmentInJointVentureInd,LandBldgEquipAccumDeprecAmt,LandBldgEquipCostOrOtherBssAmt,LegalDomicileCountryCd,LegalDomicileStateCd,LoansFromOfficersDirectorsGrp,LocalChaptersInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,MembershipDuesAmt,MethodOfAccountingAccrualInd,MethodOfAccountingCashInd,MethodOfAccountingOtherInd,MinutesOfCommitteesInd,MinutesOfGoverningBodyInd,MortgNotesPyblScrdInvstPropGrp,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,NetUnrelatedBusTxblIncmAmt,NetUnrlzdGainsLossesInvstAmt,NoListedPersonsCompensatedInd,NoncashContributionsAmt,OfficerMailingAddressInd,OrgDoesNotFollowSFAS117Ind,Organization4947a1NotPFInd,Organization501c3Ind,Organization501cInd,OrganizationFollowsSFAS117Ind,OtherEmployeeBenefitsGrp,OtherLiabilitiesGrp,OtherOrganizationDsc,OtherRevenueTotalAmt,OtherSalariesAndWagesGrp,OtherWebsiteInd,OwnWebsiteInd,PYBenefitsPaidToMembersAmt,PYContributionsGrantsAmt,PYGrantsAndSimilarPaidAmt,PYInvestmentIncomeAmt,PYOtherExpensesAmt,PYOtherRevenueAmt,PYProgramServiceRevenueAmt,PYRevenuesLessExpensesAmt,PYSalariesCompEmpBnftPaidAmt,PYTotalExpensesAmt,PYTotalProfFndrsngExpnsAmt,PYTotalRevenueAmt,PayrollTaxesGrp,PensionPlanContributionsGrp,PoliciesReferenceChaptersInd,PrincipalOfficerNm,PriorPeriodAdjustmentsAmt,ReconcilationRevenueExpnssAmt,RegularMonitoringEnfrcInd,RelatedOrganizationsAmt,RtnEarnEndowmentIncmOthFndsGrp,SavingsAndTempCashInvstGrp,SpecialConditionDesc,StatesWhereCopyOfReturnIsFldCd,TaxExemptBondLiabilitiesGrp,TotReportableCompRltdOrgAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalAssetsGrp,TotalCompGreaterThan150KInd,TotalContributionsAmt,TotalEmployeeCnt,TotalFunctionalExpensesGrp,TotalGrossUBIAmt,TotalLiabilitiesBOYAmt,TotalLiabilitiesEOYAmt,TotalOtherCompensationAmt,TotalProgramServiceRevenueAmt,TotalReportableCompFromOrgAmt,TotalRevenueGrp,TotalVolunteersCnt,TypeOfOrganizationAssocInd,TypeOfOrganizationCorpInd,TypeOfOrganizationOtherInd,TypeOfOrganizationTrustInd,UnsecuredNotesLoansPayableGrp,UponRequestInd,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,WebsiteAddressTxt,WhistleblowerPolicyInd,WrittenPolicyOrProcedureInd
7314,0,PROMOTE THE APPRECIATION OF LITERATURE,,,23238,,,,0,0.0,,0,,,,0,0,0,0,23238,20339,,,93493029008051,0,0,,0,43352779,0,0,0,,0,,"{'Total': '800', 'ManagementAndGeneral': '800'}",,,,,"{'Total': '229', 'ProgramServices': '229'}",,X,1,0,,,,,0,0.0,,,25359,,,0,,0,51,51,547,2000,,0,0,0,,,X,,1,1,,ROBERT MITCHELL JR,5,6,6,103817,109727,0,X,,5,0,0.0,0,,,X,ANGELA ARKELL MITCHELL FOUNDATION FOR LITERATURE INC,,22119,19449,,0,0.0,,X,,,,,2070,1860,5910,,,,"{'TaxPeriodEndDate': '2010-12-31', 'TaxYear': '2010'}",5910,131,0,0,"{'BOY': '103817', 'EOY': '108274'}",,MA,MA,,201012,,,"{'BOY': '103817', 'EOY': '109727'}",103817,109727,0,23238,19449,22119,"{'Total': '19449', 'ProgramServices': '12789', 'ManagementAndGeneral': '6660', 'Fundraising': '0'}",0,0,0.0,0,0,0.0,0,,0,0.0,2070,0,0,"{'TotalRevenueColumn': '25359', 'RelatedOrExemptFunctionIncome': '2070', 'UnrelatedBusinessRevenue': '0', 'ExclusionAmount': '51'}",25359,22250,,,X,,,https://s3.amazonaws.com/irs-form-990/201100299349300805_public.xml,,X,WWW.CONCORDFESTIVALOFAUTHORS.COM,0,,1997,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
7315,true,TRAIN APPRENTICES OF PLUMBERS LOCAL UNION 495,,,230914,,True,True,0,,,false,,,,false,false,false,true,230914,106466,,,93493031004001,false,false,,true,310907387,false,false,false,,false,,"{'Total': '2675', 'ProgramServices': '2675'}",,,,,,,X,true,false,,,,,0,,,,694025,,,false,,false,8106,10066,95383,95383,,false,false,false,,,X,,true,true,,,4,6,6,766665,706570,0,,,4,0,,false,,,X,PLUMBERS AND PIPEFITTERS LU 495 JAC,"{'Total': '75124', 'ProgramServices': '75124'}",263455,350664,,0,,"{'Total': '242271', 'ProgramServices': '242271'}",,,"{'Total': '23932', 'ProgramServices': '23932'}","{'Total': '62129', 'ProgramServices': '62129'}",,455005,652026,-60095,True,,,"{'TaxPeriodEndDate': '2010-12-31', 'TaxYear': '2010'}",-60095,87695,403456,417408,"{'BOY': '766665', 'EOY': '711169'}",,OH,,,201012,,,"{'BOY': '766665', 'EOY': '711169'}",766665,711169,false,230914,754120,680863,"{'Total': '754120', 'ProgramServices': '754120', 'ManagementAndGeneral': '0', 'Fundraising': '0'}",0,0,,4599,31,,0,,0,,455005,186642,7677,"{'TotalRevenueColumn': '694025', 'RelatedOrExemptFunctionIncome': '463111', 'UnrelatedBusinessRevenue': '0', 'ExclusionAmount': '0'}",694025,768558,,X,,,,https://s3.amazonaws.com/irs-form-990/201100319349300400_public.xml,,X,,true,,1936,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [40]:
#dfo = df.drop_duplicates(subset='URL')
#print(len(df),len(dfo))
#dfo[:1]

1727056 1727056


Unnamed: 0,AccountantCompileOrReview,ActivityOrMissionDescription,AddressChange,AllAffiliatesIncluded,AllOtherContributions,AmendedReturn,AnnualDisclosureCoveredPersons,AuditCommittee,BenefitsPaidToMembersCY,BenefitsPaidToMembersPriorYear,CashNonInterestBearing,ChangesToOrganizingDocs,CntrbtnsRprtdFundraisingEvents,CompCurrentOfficersDirectors,CompDisqualPersons,CompensationFromOtherSources,CompensationProcessCEO,CompensationProcessOther,ConflictOfInterestPolicy,ContributionsGrantsCurrentYear,ContributionsGrantsPriorYear,CostOfGoodsSold,CountryLegalDomicile,DLN,DecisionsSubjectToApproval,DelegationOfManagementDuties,DoNotFollowSFAS117,DocumentRetentionPolicy,EIN,ElectionOfBoardMembers,FSAudited,FamilyOrBusinessRelationship,FederalGrantAuditPerformed,FederalGrantAuditRequired,FederatedCampaigns,FeesForServicesAccounting,FeesForServicesInvstMgmntFees,FeesForServicesLegal,FeesForServicesLobbying,FeesForServicesManagement,FeesForServicesOther,FeesForServicesProfFundraising,FollowSFAS117,Form990ProvidedToGoverningBody,FormersListed,FundraisingDirectExpenses,FundraisingEvents,GamingDirectExpenses,GovernmentGrants,GrantsAndSimilarAmntsCY,GrantsAndSimilarAmntsPriorYear,GrossIncomeFundraisingEvents,GrossIncomeGaming,GrossReceipts,GrossSalesOfInventory,GroupExemptionNumber,GroupReturnForAffiliates,InitialReturn,InvestmentInJointVenture,InvestmentIncomeCurrentYear,InvestmentIncomePriorYear,LandBldgEquipmentAccumDeprec,LandBuildingsEquipmentBasis,LoansFromOfficersDirectors,LocalChapters,MaterialDiversionOrMisuse,MembersOrStockholders,MembershipDues,MethodOfAccountingAccrual,MethodOfAccountingCash,MethodOfAccountingOther,MinutesOfCommittees,MinutesOfGoverningBody,MortNotesPyblSecuredInvestProp,NameOfPrincipalOfficerPerson,NbrIndependentVotingMembers,NbrVotingGoverningBodyMembers,NbrVotingMembersGoverningBody,NetAssetsOrFundBalancesBOY,NetAssetsOrFundBalancesEOY,NetUnrelatedBusinessTxblIncome,NoListedPersonsCompensated,NoncashContributions,NumberIndependentVotingMembers,NumberIndividualsGT100K,NumberOfContractorsGT100K,OfficerMailingAddress,Organization4947a1,Organization501c,Organization501c3,OrganizationName,OtherEmployeeBenefits,OtherExpensePriorYear,OtherExpensesCurrentYear,OtherLiabilities,OtherRevenueCurrentYear,OtherRevenuePriorYear,OtherSalariesAndWages,OtherWebsite,OwnWebsite,PayrollTaxes,PensionPlanContributions,PoliciesReferenceChapters,ProgramServiceRevenueCY,ProgramServiceRevenuePriorYear,ReconcilationRevenueExpenses,RegularMonitoringEnforcement,RelatedOrganizations,RetainedEarningsEndowmentEtc,ReturnHeader,RevenuesLessExpensesCY,RevenuesLessExpensesPriorYear,SalariesEtcCurrentYear,SalariesEtcPriorYear,SavingsAndTempCashInvestments,SpecialConditionDescription,StateLegalDomicile,StatesWhereCopyOfReturnIsFiled,TaxExemptBondLiabilities,TaxPeriod,TerminatedReturn,TerminationOrContraction,TotalAssets,TotalAssetsBOY,TotalAssetsEOY,TotalCompGT150K,TotalContributions,TotalExpensesCurrentYear,TotalExpensesPriorYear,TotalFunctionalExpenses,TotalFundrsngExpCurrentYear,TotalGrossUBI,TotalLiabilitiesBOY,TotalLiabilitiesEOY,TotalNbrEmployees,TotalNbrVolunteers,TotalOtherCompensation,TotalOtherRevenue,TotalProfFundrsngExpCY,TotalProfFundrsngExpPriorYear,TotalProgramServiceRevenue,TotalReportableCompFrmRltdOrgs,TotalReportableCompFromOrg,TotalRevenue,TotalRevenueCurrentYear,TotalRevenuePriorYear,TypeOfOrgOtherDescription,TypeOfOrganizationAssociation,TypeOfOrganizationCorporation,TypeOfOrganizationOther,TypeOfOrganizationTrust,URL,UnsecuredNotesLoansPayable,UponRequest,WebSite,WhistleblowerPolicy,WrittenPolicyOrProcedure,YearFormation,ReconcilationDonatedServices,ReconcilationInvestExpenses,ReconcilationPriorAdjustment,ReconciliationUnrealizedInvest,AccountantCompileOrReviewInd,ActivityOrMissionDesc,AddressChangeInd,AllAffiliatesIncludedInd,AllOtherContributionsAmt,AmendedReturnInd,AnnualDisclosureCoveredPrsnInd,AuditCommitteeInd,CYBenefitsPaidToMembersAmt,CYContributionsGrantsAmt,CYGrantsAndSimilarPaidAmt,CYInvestmentIncomeAmt,CYOtherExpensesAmt,CYOtherRevenueAmt,CYProgramServiceRevenueAmt,CYRevenuesLessExpensesAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalExpensesAmt,CYTotalFundraisingExpenseAmt,CYTotalProfFndrsngExpnsAmt,CYTotalRevenueAmt,CashNonInterestBearingGrp,ChangeToOrgDocumentsInd,CntrctRcvdGreaterThan100KCnt,CompCurrentOfcrDirectorsGrp,CompDisqualPersonsGrp,CompensationFromOtherSrcsInd,CompensationProcessCEOInd,CompensationProcessOtherInd,ConflictOfInterestPolicyInd,ContractTerminationInd,ContriRptFundraisingEventAmt,CostOfGoodsSoldAmt,DecisionsSubjectToApprovaInd,DelegationOfMgmtDutiesInd,DocumentRetentionPolicyInd,DonatedServicesAndUseFcltsAmt,ElectionOfBoardMembersInd,FSAuditedInd,FamilyOrBusinessRlnInd,FederalGrantAuditPerformedInd,FederalGrantAuditRequiredInd,FederatedCampaignsAmt,FeesForServicesAccountingGrp,FeesForServicesLegalGrp,FeesForServicesLobbyingGrp,FeesForServicesManagementGrp,FeesForServicesOtherGrp,FeesForSrvcInvstMgmntFeesGrp,FinalReturnInd,Form990ProvidedToGvrnBodyInd,FormationYr,FormerOfcrEmployeesListedInd,FundraisingAmt,FundraisingDirectExpensesAmt,FundraisingGrossIncomeAmt,GamingDirectExpensesAmt,GamingGrossIncomeAmt,GoverningBodyVotingMembersCnt,GovernmentGrantsAmt,GrossReceiptsAmt,GrossSalesOfInventoryAmt,GroupExemptionNum,GroupReturnForAffiliatesInd,IndependentVotingMemberCnt,IndivRcvdGreaterThan100KCnt,InitialReturnInd,InvestmentExpenseAmt,InvestmentInJointVentureInd,LandBldgEquipAccumDeprecAmt,LandBldgEquipCostOrOtherBssAmt,LegalDomicileCountryCd,LegalDomicileStateCd,LoansFromOfficersDirectorsGrp,LocalChaptersInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,MembershipDuesAmt,MethodOfAccountingAccrualInd,MethodOfAccountingCashInd,MethodOfAccountingOtherInd,MinutesOfCommitteesInd,MinutesOfGoverningBodyInd,MortgNotesPyblScrdInvstPropGrp,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,NetUnrelatedBusTxblIncmAmt,NetUnrlzdGainsLossesInvstAmt,NoListedPersonsCompensatedInd,NoncashContributionsAmt,OfficerMailingAddressInd,OrgDoesNotFollowSFAS117Ind,Organization4947a1NotPFInd,Organization501c3Ind,Organization501cInd,OrganizationFollowsSFAS117Ind,OtherEmployeeBenefitsGrp,OtherLiabilitiesGrp,OtherOrganizationDsc,OtherRevenueTotalAmt,OtherSalariesAndWagesGrp,OtherWebsiteInd,OwnWebsiteInd,PYBenefitsPaidToMembersAmt,PYContributionsGrantsAmt,PYGrantsAndSimilarPaidAmt,PYInvestmentIncomeAmt,PYOtherExpensesAmt,PYOtherRevenueAmt,PYProgramServiceRevenueAmt,PYRevenuesLessExpensesAmt,PYSalariesCompEmpBnftPaidAmt,PYTotalExpensesAmt,PYTotalProfFndrsngExpnsAmt,PYTotalRevenueAmt,PayrollTaxesGrp,PensionPlanContributionsGrp,PoliciesReferenceChaptersInd,PrincipalOfficerNm,PriorPeriodAdjustmentsAmt,ReconcilationRevenueExpnssAmt,RegularMonitoringEnfrcInd,RelatedOrganizationsAmt,RtnEarnEndowmentIncmOthFndsGrp,SavingsAndTempCashInvstGrp,SpecialConditionDesc,StatesWhereCopyOfReturnIsFldCd,TaxExemptBondLiabilitiesGrp,TotReportableCompRltdOrgAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalAssetsGrp,TotalCompGreaterThan150KInd,TotalContributionsAmt,TotalEmployeeCnt,TotalFunctionalExpensesGrp,TotalGrossUBIAmt,TotalLiabilitiesBOYAmt,TotalLiabilitiesEOYAmt,TotalOtherCompensationAmt,TotalProgramServiceRevenueAmt,TotalReportableCompFromOrgAmt,TotalRevenueGrp,TotalVolunteersCnt,TypeOfOrganizationAssocInd,TypeOfOrganizationCorpInd,TypeOfOrganizationOtherInd,TypeOfOrganizationTrustInd,UnsecuredNotesLoansPayableGrp,UponRequestInd,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,WebsiteAddressTxt,WhistleblowerPolicyInd,WrittenPolicyOrProcedureInd
0,0,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,X,,1439340,,1,1,0,0,,0,,,,0,0,0,1,1439340,1044925,,,93493313013011,0,0,,0,232705170,0,1,0,,0,,"{'Total': '21675', 'ManagementAndGeneral': '21675'}",,"{'Total': '215', 'ManagementAndGeneral': '215'}",,,,,X,1,0,,,,,925000,638637,,,1473903,,,0,,0,33563,30447,86228,256845,,0,0,0,,X,,,1,1,,MICHAEL ANTON,10,10,10,1753405,1990429,0,X,,10,0,0,0,,,X,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,,243131,459751,"{'BOY': '51640', 'EOY': '240077'}",1000,0,,,,,,,0,0,89152,1,,,"{'TaxPeriodEndDate': '2010-12-31', 'TaxYear': '2010'}",89152,193604,0,0,"{'BOY': '332660', 'EOY': '270700'}",,PA,"[PA, NJ, DE]",,201012,,,"{'BOY': '1925215', 'EOY': '2440859'}",1925215,2440859,0,1439340,1384751,881768,"{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}",195892,0,171810,450430,0,0,0,1000,0,0,,0,0,"{'TotalRevenueColumn': '1473903', 'RelatedOrExemptFunctionIncome': '1000', 'UnrelatedBusinessRevenue': '0', 'ExclusionAmount': '33563'}",1473903,1075372,,,X,,,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,,X,,0,,1992,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


<br>39,729 990 filings are not in the database

In [72]:
#1587557 - len(dfo)

39729

<br>70,092 duplicates were dropped
<br>UPDATE: none dropped

### Create *Fiscal Year* Variable

In [84]:
[c for c in df.columns.tolist() if 'Tax' in c]

['TaxPeriod',
 'PayrollTaxes',
 'TaxExemptBondLiabilities',
 'PayrollTaxesGrp',
 'TaxExemptBondLiabilitiesGrp']

In [85]:
print(len(df[df['TaxPeriod'].notnull()]))

1895015


In [86]:
print(len(df[df['TaxPeriod'].isnull()]))

1


In [87]:
df['TaxPeriod'].value_counts().head()

201712    138562
201812    136698
201612    131298
201512    124527
201412    115139
Name: TaxPeriod, dtype: int64

In [88]:
df['TaxPeriod'].dtype

dtype('O')

In [89]:
df['fiscal_year'] = df['TaxPeriod'].str[:4]
df['fiscal_year'].value_counts()

2017    251118
2018    250237
2016    240291
2015    228000
2014    210538
2013    190710
2012    170761
2011    139300
2019    113354
2010     98185
2020      2520
2108         1
Name: fiscal_year, dtype: int64

In [90]:
years = pd.DataFrame(df['fiscal_year'].value_counts())
years.index.name = 'year'
years = years.reset_index()
years = years.sort_values('year')
years

Unnamed: 0,year,fiscal_year
9,2010,98185
7,2011,139300
6,2012,170761
5,2013,190710
4,2014,210538
3,2015,228000
2,2016,240291
0,2017,251118
1,2018,250237
8,2019,113354


In [91]:
print("Number of columns:", len(df.columns))
print("Number of observations:", len(df))
df[:1]    

Number of columns: 324
Number of observations: 1895016


Unnamed: 0,OrganizationName,URL,DLN,TaxPeriod,EIN,ReturnHeader,AddressChange,NameOfPrincipalOfficerPerson,GrossReceipts,GroupReturnForAffiliates,Organization501c3,WebSite,TypeOfOrganizationCorporation,YearFormation,StateLegalDomicile,ActivityOrMissionDescription,NbrVotingMembersGoverningBody,NbrIndependentVotingMembers,TotalNbrEmployees,TotalNbrVolunteers,TotalGrossUBI,NetUnrelatedBusinessTxblIncome,ContributionsGrantsPriorYear,ContributionsGrantsCurrentYear,ProgramServiceRevenuePriorYear,ProgramServiceRevenueCY,InvestmentIncomePriorYear,InvestmentIncomeCurrentYear,OtherRevenuePriorYear,OtherRevenueCurrentYear,TotalRevenuePriorYear,TotalRevenueCurrentYear,GrantsAndSimilarAmntsPriorYear,GrantsAndSimilarAmntsCY,BenefitsPaidToMembersPriorYear,BenefitsPaidToMembersCY,SalariesEtcPriorYear,SalariesEtcCurrentYear,TotalProfFundrsngExpPriorYear,TotalProfFundrsngExpCY,TotalFundrsngExpCurrentYear,OtherExpensePriorYear,OtherExpensesCurrentYear,TotalExpensesPriorYear,TotalExpensesCurrentYear,RevenuesLessExpensesPriorYear,RevenuesLessExpensesCY,TotalAssetsBOY,TotalAssetsEOY,TotalLiabilitiesBOY,TotalLiabilitiesEOY,NetAssetsOrFundBalancesBOY,NetAssetsOrFundBalancesEOY,ProfessionalFundraising,FundraisingActivities,Gaming,NbrVotingGoverningBodyMembers,NumberIndependentVotingMembers,FamilyOrBusinessRelationship,DelegationOfManagementDuties,ChangesToOrganizingDocs,MaterialDiversionOrMisuse,MembersOrStockholders,ElectionOfBoardMembers,DecisionsSubjectToApproval,MinutesOfGoverningBody,MinutesOfCommittees,OfficerMailingAddress,LocalChapters,Form990ProvidedToGoverningBody,ConflictOfInterestPolicy,AnnualDisclosureCoveredPersons,RegularMonitoringEnforcement,WhistleblowerPolicy,DocumentRetentionPolicy,CompensationProcessCEO,CompensationProcessOther,InvestmentInJointVenture,StatesWhereCopyOfReturnIsFiled,UponRequest,NoListedPersonsCompensated,TotalReportableCompFromOrg,TotalReportableCompFrmRltdOrgs,TotalOtherCompensation,NumberIndividualsGT100K,FormersListed,TotalCompGT150K,CompensationFromOtherSources,NumberOfContractorsGT100K,AllOtherContributions,TotalContributions,TotalOtherRevenue,TotalRevenue,FeesForServicesLegal,FeesForServicesAccounting,TotalFunctionalExpenses,SavingsAndTempCashInvestments,LandBuildingsEquipmentBasis,LandBldgEquipmentAccumDeprec,TotalAssets,OtherLiabilities,FollowSFAS117,ReconcilationRevenueExpenses,MethodOfAccountingAccrual,AccountantCompileOrReview,FSAudited,AuditCommittee,FederalGrantAuditRequired,AllAffiliatesIncluded,GroupExemptionNumber,PoliciesReferenceChapters,WrittenPolicyOrProcedure,TotalProgramServiceRevenue,CompCurrentOfficersDirectors,CompDisqualPersons,OtherSalariesAndWages,PensionPlanContributions,OtherEmployeeBenefits,PayrollTaxes,FeesForServicesManagement,FeesForServicesLobbying,FeesForServicesProfFundraising,FeesForServicesInvstMgmntFees,FeesForServicesOther,CashNonInterestBearing,MortNotesPyblSecuredInvestProp,FederalGrantAuditPerformed,LoansFromOfficersDirectors,MethodOfAccountingCash,TaxExemptBondLiabilities,OtherWebsite,FundraisingEvents,CntrbtnsRprtdFundraisingEvents,RelatedOrganizations,GrossIncomeFundraisingEvents,FundraisingDirectExpenses,FederatedCampaigns,GovernmentGrants,MethodOfAccountingOther,GrossSalesOfInventory,CostOfGoodsSold,DoNotFollowSFAS117,RetainedEarningsEndowmentEtc,InitialReturn,MembershipDues,GrossIncomeGaming,GamingDirectExpenses,NoncashContributions,OwnWebsite,UnsecuredNotesLoansPayable,TypeOfOrganizationOther,Organization501c,TypeOfOrganizationTrust,TypeOfOrganizationAssociation,CountryLegalDomicile,AmendedReturn,TypeOfOrgOtherDescription,TerminatedReturn,TerminationOrContraction,SpecialConditionDescription,Organization4947a1,ReconciliationUnrealizedInvest,ReconcilationPriorAdjustment,ReconcilationDonatedServices,ReconcilationInvestExpenses,PrincipalOfficerNm,GrossReceiptsAmt,GroupReturnForAffiliatesInd,Organization501c3Ind,TypeOfOrganizationCorpInd,FormationYr,LegalDomicileStateCd,ActivityOrMissionDesc,VotingMembersGoverningBodyCnt,VotingMembersIndependentCnt,TotalEmployeeCnt,TotalGrossUBIAmt,CYContributionsGrantsAmt,CYProgramServiceRevenueAmt,CYInvestmentIncomeAmt,CYOtherRevenueAmt,CYTotalRevenueAmt,CYGrantsAndSimilarPaidAmt,CYBenefitsPaidToMembersAmt,CYSalariesCompEmpBnftPaidAmt,CYTotalProfFndrsngExpnsAmt,CYTotalFundraisingExpenseAmt,CYOtherExpensesAmt,CYTotalExpensesAmt,CYRevenuesLessExpensesAmt,TotalAssetsBOYAmt,TotalAssetsEOYAmt,TotalLiabilitiesEOYAmt,NetAssetsOrFundBalancesBOYAmt,NetAssetsOrFundBalancesEOYAmt,ProfessionalFundraisingInd,FundraisingActivitiesInd,GamingActivitiesInd,GoverningBodyVotingMembersCnt,IndependentVotingMemberCnt,FamilyOrBusinessRlnInd,DelegationOfMgmtDutiesInd,ChangeToOrgDocumentsInd,MaterialDiversionOrMisuseInd,MembersOrStockholdersInd,ElectionOfBoardMembersInd,DecisionsSubjectToApprovaInd,MinutesOfGoverningBodyInd,MinutesOfCommitteesInd,OfficerMailingAddressInd,LocalChaptersInd,Form990ProvidedToGvrnBodyInd,ConflictOfInterestPolicyInd,WhistleblowerPolicyInd,DocumentRetentionPolicyInd,CompensationProcessCEOInd,CompensationProcessOtherInd,InvestmentInJointVentureInd,StatesWhereCopyOfReturnIsFldCd,NoListedPersonsCompensatedInd,FormerOfcrEmployeesListedInd,TotalCompGreaterThan150KInd,CompensationFromOtherSrcsInd,MembershipDuesAmt,FundraisingAmt,AllOtherContributionsAmt,TotalContributionsAmt,OtherRevenueTotalAmt,TotalRevenueGrp,FeesForServicesAccountingGrp,TotalFunctionalExpensesGrp,CashNonInterestBearingGrp,TotalAssetsGrp,OrgDoesNotFollowSFAS117Ind,RtnEarnEndowmentIncmOthFndsGrp,ReconcilationRevenueExpnssAmt,MethodOfAccountingCashInd,AccountantCompileOrReviewInd,FSAuditedInd,FederalGrantAuditRequiredInd,WebsiteAddressTxt,TotalVolunteersCnt,NetUnrelatedBusTxblIncmAmt,PYContributionsGrantsAmt,PYProgramServiceRevenueAmt,PYInvestmentIncomeAmt,PYOtherRevenueAmt,PYTotalRevenueAmt,PYGrantsAndSimilarPaidAmt,PYBenefitsPaidToMembersAmt,PYSalariesCompEmpBnftPaidAmt,PYTotalProfFndrsngExpnsAmt,PYOtherExpensesAmt,PYTotalExpensesAmt,PYRevenuesLessExpensesAmt,TotalLiabilitiesBOYAmt,AnnualDisclosureCoveredPrsnInd,RegularMonitoringEnfrcInd,UponRequestInd,TotalReportableCompFromOrgAmt,TotReportableCompRltdOrgAmt,TotalOtherCompensationAmt,IndivRcvdGreaterThan100KCnt,CntrctRcvdGreaterThan100KCnt,GovernmentGrantsAmt,TotalProgramServiceRevenueAmt,FundraisingGrossIncomeAmt,ContriRptFundraisingEventAmt,FundraisingDirectExpensesAmt,GrossSalesOfInventoryAmt,CostOfGoodsSoldAmt,CompCurrentOfcrDirectorsGrp,OtherSalariesAndWagesGrp,PensionPlanContributionsGrp,OtherEmployeeBenefitsGrp,PayrollTaxesGrp,FeesForServicesOtherGrp,SavingsAndTempCashInvstGrp,LandBldgEquipCostOrOtherBssAmt,LandBldgEquipAccumDeprecAmt,MortgNotesPyblScrdInvstPropGrp,OtherLiabilitiesGrp,OrganizationFollowsSFAS117Ind,NetUnrlzdGainsLossesInvstAmt,AuditCommitteeInd,AllAffiliatesIncludedInd,CompDisqualPersonsGrp,FeesForServicesManagementGrp,FeesForServicesLegalGrp,FeesForServicesLobbyingGrp,FeesForSrvcInvstMgmntFeesGrp,MethodOfAccountingAccrualInd,NoncashContributionsAmt,TaxExemptBondLiabilitiesGrp,LoansFromOfficersDirectorsGrp,UnsecuredNotesLoansPayableGrp,PriorPeriodAdjustmentsAmt,FederalGrantAuditPerformedInd,PoliciesReferenceChaptersInd,OtherWebsiteInd,AddressChangeInd,WrittenPolicyOrProcedureInd,RelatedOrganizationsAmt,OwnWebsiteInd,DonatedServicesAndUseFcltsAmt,LegalDomicileCountryCd,TypeOfOrganizationTrustInd,FinalReturnInd,ContractTerminationInd,GroupExemptionNum,FederatedCampaignsAmt,TypeOfOrganizationOtherInd,OtherOrganizationDsc,TypeOfOrganizationAssocInd,InitialReturnInd,GamingGrossIncomeAmt,GamingDirectExpensesAmt,MethodOfAccountingOtherInd,InvestmentExpenseAmt,Organization501cInd,Organization4947a1NotPFInd,AmendedReturnInd,SpecialConditionDesc,fiscal_year
0,RONALD MCDONALD HOUSE CHARITIES- PHILADELPHIA REGION INC,https://s3.amazonaws.com/irs-form-990/201113139349301301_public.xml,93493313013011,201012,232705170,"{'@binaryAttachmentCount': '0', 'Timestamp': '2011-11-09T06:41:09-06:00', 'TaxPeriodEndDate': '2010-12-31', 'PreparerFirm': {'PreparerFirmBusinessName': {'BusinessNameLine1': 'CONCANNON MILLER & CO PC'}, 'PreparerFirmUSAddress': {'AddressLine1': ...",X,MICHAEL ANTON,1473903,0,X,,X,1992,PA,MAKES GRANTS TO NON-PROFITS THAT DIRECTLY IMPROVE THE HEALTH AND WELL-BEING OF CHILDREN.,10,10,0,0,0,0,1044925,1439340,0,0,30447,33563,0,1000,1075372,1473903,638637,925000,0,0,0,0,0,0,195892,243131,459751,881768,1384751,193604,89152,1925215,2440859,171810,450430,1753405,1990429,0,0,0,10,10,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,0,0,"[PA, NJ, DE]",X,X,0,0,0,0,0,0,0,0,1439340,1439340,1000,"{'TotalRevenueColumn': '1473903', 'RelatedOrExemptFunctionIncome': '1000', 'UnrelatedBusinessRevenue': '0', 'ExclusionAmount': '33563'}","{'Total': '215', 'ManagementAndGeneral': '215'}","{'Total': '21675', 'ManagementAndGeneral': '21675'}","{'Total': '1384751', 'ProgramServices': '1043744', 'ManagementAndGeneral': '145115', 'Fundraising': '195892'}","{'BOY': '332660', 'EOY': '270700'}",256845,86228,"{'BOY': '1925215', 'EOY': '2440859'}","{'BOY': '51640', 'EOY': '240077'}",X,89152,X,0,1,1,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2010


### Initital Verifications - Check whether it contains all relevant columns

In [92]:
id_cols = ['DLN', 'EIN', 'URL', 'OrganizationName', 'ReturnHeader']

In [93]:
print(len(set(df.columns.tolist())) - len(set(id_cols)))
print(len(set(df.columns.tolist())) - len(set(mongo_cols)) - len(set(id_cols)))

319
-9


In [94]:
set(mongo_cols) - set(df.columns.tolist())

{'BuildTS',
 'BusinessOfficerGrp',
 'Filer',
 'Officer',
 'ReturnTs',
 'TaxPeriodEndDate',
 'TaxPeriodEndDt',
 'TaxYear',
 'TaxYr',
 'Timestamp'}

In [95]:
set(df.columns.tolist()) - set(mongo_cols)

{'DLN', 'EIN', 'OrganizationName', 'ReturnHeader', 'URL', 'fiscal_year'}

In [96]:
len(df)

1895016

In [97]:
len(df.columns)

324

### Save DF

In [98]:
pwd

'C:\\Users\\Gregory\\IRS 990 Control Variables'

In [99]:
cd "C:\\Users\\Gregory\\IRS 990 Control Variables\\"

C:\Users\Gregory\IRS 990 Control Variables


In [100]:
%%time
df.to_pickle('all filings nov. 2020 - all control variables.pkl.gz', compression='gzip')

Wall time: 26min 25s
