## Overview
This is the first in a series of tutorials that illustrate how to download the 990 e-file data the IRS started to make public in 2017. The IRS 990 e-file data are housed on Amazon Web Services (AWS) at https://aws.amazon.com/public-data-sets/irs-990/

In this first notebook we will access the AWS data and download the annual index files that list all available 990 filings. Specifically, we will download the index files and insert them into a MongoDB database. 

Note that in 2022 the IRS has created a different process for making the electronic 990 forms available. I will create a series of tutorials for the new process in the near future. The following code will allow you to access all e-file data that was made available up to December 31, 2021.

### Load packages and set working directory

In [2]:
import sys
import time
import json

#### Show current date and time to track latest time we've used the code

In [4]:
%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

Wall time: 0 ns
Current date and time :  2022-06-19 15:46:36 



#### MongoDB
Depending on the project, I will store the data in SQLite or MongoDB. This time I'll use MongoDB -- it's great for storing JSON data where each observation could have different variables. Before we get to the interesting part the following code blocks set up the MongoDB environment and the new database we'll be using. 

**_Note:_** In a terminal we'll have to start MongoDB by running the command *mongod* or *sudo mongod*. Then we run the following code block here to access MongoDB.

In [5]:
import pymongo
from pymongo import MongoClient
client = MongoClient()

<br>
We first have to define a database and then a table or *collection* where for storing the File listing information we will download. I decided it would be better to have a different index *collection* in the database for each year.

In [29]:
# DEFINE THE MONGODB DATABASE
db = client['irs_990_db']

# DEFINE THE COLLECTIONS WHERE I'LL INSERT THE DATA
file_list_2011 = db['file_list_2011']
file_list_2012 = db['file_list_2012']
file_list_2013 = db['file_list_2013']
file_list_2014 = db['file_list_2014']
file_list_2015 = db['file_list_2015']
file_list_2016 = db['file_list_2016']
file_list_2017 = db['file_list_2017']
file_list_2018 = db['file_list_2018']
file_list_2019 = db['file_list_2019']
file_list_2020 = db['file_list_2020']
file_list_2021 = db['file_list_2021']

<br>Check how many observations in the database tables. Values will be zero until we add data. 

In [31]:
print(file_list_2011.estimated_document_count())
print(file_list_2012.estimated_document_count())
print(file_list_2013.estimated_document_count())
print(file_list_2014.estimated_document_count())
print(file_list_2015.estimated_document_count())
print(file_list_2016.estimated_document_count())
print(file_list_2017.estimated_document_count())
print(file_list_2018.estimated_document_count())
print(file_list_2019.estimated_document_count())
print(file_list_2020.estimated_document_count())
print(file_list_2021.estimated_document_count())

0
0
0
0
0
0
0
0
0
0
0


# Download index data

Let's first create a python *list* containing all index file years.

In [32]:
year_list = []
for year in range(2011, 2022, 1):
    year_list.append(year)
print(year_list)

[2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]


<br>We will now create a python *dictionary* (which we called *data*) that will hold each year's index of filings. In the following block of code we begin a *for loop*, looping over each year of our list to in turn access the respective annual key in our dictionary. 

If you are unfamiliar with Python, each year has its own *key* in the *data* dictionary; for instance the 2011 data will be nested under the  <code>data['Filings2011']</code> key. We use the *year* values in the python list we have just created to access each dictionary key in turn. The trick is the <code>%s</code> string formatting placeholder. The <code>%s</code> signifies a variable, and the <code>year</code> following the percentage sign tells the code which value to use for this variable. Within the context of a <code>for loop</code>, we thus have code that will create each of our keys in turn. 

In the final line of code here we access the key in each loop and print out the number of filings for that year. This process will take a few minutes as each index file is downloaded in turn. 

In [33]:
import requests
data = {}
for year in year_list:
    url = 'https://s3.amazonaws.com/irs-form-990/index_%s.json' % year
    f = requests.get(url)
    data['Filings%s' % year] = f.json()['Filings%s' % year]
    print('# of filings in', year, ':', len(data['Filings%s' % year]))    
print('# of years of data:', len(data))

# of filings in 2011 : 203075
# of filings in 2012 : 261622
# of filings in 2013 : 261449
# of filings in 2014 : 387529
# of filings in 2015 : 261034
# of filings in 2016 : 378420
# of filings in 2017 : 489013
# of filings in 2018 : 457510
# of filings in 2019 : 416910
# of filings in 2020 : 333722
# of filings in 2021 : 461887
# of years of data: 11


<br>We can now print out the number of filings that have been successfully inserted for each year.

In [34]:
for year in year_list:
    print('# of filings in', year, ':', len(data['Filings%s' % year]))

# of filings in 2011 : 203075
# of filings in 2012 : 261622
# of filings in 2013 : 261449
# of filings in 2014 : 387529
# of filings in 2015 : 261034
# of filings in 2016 : 378420
# of filings in 2017 : 489013
# of filings in 2018 : 457510
# of filings in 2019 : 416910
# of filings in 2020 : 333722
# of filings in 2021 : 461887


<br>To view one entry in the 2021 index we run the following code block. You can see here that each entry in the index file contains minimal information on each filing. What we need these indexes for is really only the *URL* value; in the next Jupyter notebook we will access the full 990 filings at using those *URL* addresses.

In [35]:
data['Filings2021'][:1]

[{'EIN': '836009168',
  'TaxPeriod': '202006',
  'DLN': '93492014007381',
  'FormType': '990EZ',
  'URL': 'https://s3.amazonaws.com/irs-form-990/202130149349200738_public.xml',
  'OrganizationName': 'TORRINGTON ROTARY CLUB',
  'SubmittedOn': '2021-04-22',
  'ObjectId': '202130149349200738',
  'LastUpdated': '2021-06-11T13:10:10'}]

<br>We can also run the following code block to count the total number of filings indexed: 3,912,171. This includes non-501(c)(3) organizations and covers *990EZ*, *990PF*, and *990* filings. 

In [51]:
sum(len(v) for v in data.values())

3912171

### Read JSON file into MongoDB database
What we will do now is insert our data into the MongoDB database. Because the data are nested in the 11 annual dictionary keys, we will loop over each of those keys and insert each year individually. The <code>insert_many</code> takes care of the data insertion. 

In [17]:
for year in year_list:
    print('# of filings in database:', file_list_2011_2019.estimated_document_count())
    file_list_2011_2019.insert_many(data['Filings%s' % year])
    print('# of filings in', year, 'added to database:', len(data['Filings%s' % year]), '\n')
print('Total # of filings in database:', file_list_2011_2019.estimated_document_count() )

In [37]:
counts = []
for year in year_list[:]:    
    col = 'file_list_%s' % year
    print('# of filings in %s database:' % year, eval(col).estimated_document_count())
    eval(col).insert_many(data['Filings%s' % year])
    print('# of filings in', year, 'to be added to database:', len(data['Filings%s' % year])-eval(col).estimated_document_count()) 
    counts.append(eval(col).estimated_document_count())
    print('# of filings in', year, 'added to database:', eval(col).estimated_document_count(), '\n')
counts   

# of filings in 2011 database: 0
# of filings in 2011 to be added to database: 0
# of filings in 2011 added to database: 203075 

# of filings in 2012 database: 0
# of filings in 2012 to be added to database: 0
# of filings in 2012 added to database: 261622 

# of filings in 2013 database: 0
# of filings in 2013 to be added to database: 0
# of filings in 2013 added to database: 261449 

# of filings in 2014 database: 0
# of filings in 2014 to be added to database: 0
# of filings in 2014 added to database: 387529 

# of filings in 2015 database: 0
# of filings in 2015 to be added to database: 0
# of filings in 2015 added to database: 261034 

# of filings in 2016 database: 0
# of filings in 2016 to be added to database: 0
# of filings in 2016 added to database: 378420 

# of filings in 2017 database: 0
# of filings in 2017 to be added to database: 0
# of filings in 2017 added to database: 489013 

# of filings in 2018 database: 0
# of filings in 2018 to be added to database: 0
# of fili

[203075,
 261622,
 261449,
 387529,
 261034,
 378420,
 489013,
 457510,
 416910,
 333722,
 461887]

In [38]:
for year in year_list:
    print('# of filings in', year, ':', len(data['Filings%s' % year]))

# of filings in 2011 : 203075
# of filings in 2012 : 261622
# of filings in 2013 : 261449
# of filings in 2014 : 387529
# of filings in 2015 : 261034
# of filings in 2016 : 378420
# of filings in 2017 : 489013
# of filings in 2018 : 457510
# of filings in 2019 : 416910
# of filings in 2020 : 333722
# of filings in 2021 : 461887


##### Count all of the files in the individual datasets

In [50]:
print(file_list_2011.estimated_document_count())
print(file_list_2012.estimated_document_count())
print(file_list_2013.estimated_document_count())
print(file_list_2014.estimated_document_count())
print(file_list_2015.estimated_document_count())
print(file_list_2016.estimated_document_count())
print(file_list_2017.estimated_document_count())
print(file_list_2018.estimated_document_count())
print(file_list_2019.estimated_document_count())
print(file_list_2020.estimated_document_count())
print(file_list_2021.estimated_document_count())

203075
261622
261449
387529
261034
378420
489013
457510
416910
333722
461887


<br>Here is an alternative approach. First, create a dictionary:

In [39]:
file_list = []
for year in year_list:
    file_list.append('file_list_%s.estimated_document_count()' % year)
print(file_list)

['file_list_2011.estimated_document_count()', 'file_list_2012.estimated_document_count()', 'file_list_2013.estimated_document_count()', 'file_list_2014.estimated_document_count()', 'file_list_2015.estimated_document_count()', 'file_list_2016.estimated_document_count()', 'file_list_2017.estimated_document_count()', 'file_list_2018.estimated_document_count()', 'file_list_2019.estimated_document_count()', 'file_list_2020.estimated_document_count()', 'file_list_2021.estimated_document_count()']


<br>Then create a dictionary:

In [40]:
db_dict = dict(zip(year_list,file_list))
db_dict

{2011: 'file_list_2011.estimated_document_count()',
 2012: 'file_list_2012.estimated_document_count()',
 2013: 'file_list_2013.estimated_document_count()',
 2014: 'file_list_2014.estimated_document_count()',
 2015: 'file_list_2015.estimated_document_count()',
 2016: 'file_list_2016.estimated_document_count()',
 2017: 'file_list_2017.estimated_document_count()',
 2018: 'file_list_2018.estimated_document_count()',
 2019: 'file_list_2019.estimated_document_count()',
 2020: 'file_list_2020.estimated_document_count()',
 2021: 'file_list_2021.estimated_document_count()'}

<br>And finally, loop over the dictionary. The problem is that the dictionary key is for the *estimated_document_count( )* command; I don't then have one for the *insert_many( )* command. So, I am using the alternative approach shown above.

In [44]:
for year in year_list[:]:
    #print('# of filings in database:', file_list_%s.estimated_document_count() %s)
    col = 'file_list_%s' % year
    #print(col)
    print(col, '\t', eval(col).estimated_document_count())
    #eval(col).insert_many(data['Filings%s' % year])
    #print(eval(col).estimated_document_count())

file_list_2011 	 203075
file_list_2012 	 261622
file_list_2013 	 261449
file_list_2014 	 387529
file_list_2015 	 261034
file_list_2016 	 378420
file_list_2017 	 489013
file_list_2018 	 457510
file_list_2019 	 416910
file_list_2020 	 333722
file_list_2021 	 461887


<br>We can inspect the data by checking the first two filings in the database:

In [47]:
for user in file_list_2011.find()[:2]:
    print(user, '\n')

{'_id': ObjectId('62af860e55dbc5a4cb59b27d'), 'EIN': '591971002', 'TaxPeriod': '201009', 'DLN': '93493316003251', 'FormType': '990', 'URL': 'https://s3.amazonaws.com/irs-form-990/201103169349300325_public.xml', 'OrganizationName': 'ANGELUS INC', 'SubmittedOn': '2011-11-30', 'ObjectId': '201103169349300325', 'LastUpdated': '2016-03-21T17:23:53'} 

{'_id': ObjectId('62af860e55dbc5a4cb59b27e'), 'EIN': '251713602', 'TaxPeriod': '201106', 'DLN': '93493313012311', 'FormType': '990', 'URL': 'https://s3.amazonaws.com/irs-form-990/201113139349301231_public.xml', 'OrganizationName': 'TOUCH-STONE SOLUTIONS INC', 'SubmittedOn': '2011-11-30', 'ObjectId': '201113139349301231', 'LastUpdated': '2016-03-21T17:23:53'} 



Let's also get the frequency counts for the various *FormType*s. We see the frequencies for 990PF, 990EZ, and 990 filings for the 2021 index collection.

In [48]:
from bson.son import SON
pipeline = [ {"$group": {"_id": "$FormType", "count": {"$sum": 1}}} ]
list(file_list_2021.aggregate(pipeline))

[{'_id': '990EZ', 'count': 129184},
 {'_id': '990', 'count': 253933},
 {'_id': '990PF', 'count': 78770}]

Loop for all individual year collections in the database

In [49]:
for year in year_list[:]:    
    col = 'file_list_%s' % year
    print(list(eval(col).aggregate(pipeline)))

[{'_id': '990EZ', 'count': 65858}, {'_id': '990PF', 'count': 24199}, {'_id': '990', 'count': 113018}]
[{'_id': '990EZ', 'count': 79633}, {'_id': '990PF', 'count': 33822}, {'_id': '990', 'count': 148167}]
[{'_id': '990EZ', 'count': 82475}, {'_id': '990PF', 'count': 25414}, {'_id': '990', 'count': 153560}]
[{'_id': '990EZ', 'count': 112403}, {'_id': '990PF', 'count': 59255}, {'_id': '990', 'count': 215871}]
[{'_id': '990EZ', 'count': 81244}, {'_id': '990PF', 'count': 38650}, {'_id': '990', 'count': 141140}]
[{'_id': '990', 'count': 211537}, {'_id': '990PF', 'count': 53694}, {'_id': '990EZ', 'count': 113189}]
[{'_id': '990EZ', 'count': 145966}, {'_id': '990PF', 'count': 68547}, {'_id': '990', 'count': 274500}]
[{'_id': '990EZ', 'count': 136384}, {'_id': '990PF', 'count': 67632}, {'_id': '990', 'count': 253494}]
[{'_id': '990EZ', 'count': 124815}, {'_id': '990PF', 'count': 64720}, {'_id': '990', 'count': 227375}]
[{'_id': '990EZ', 'count': 105844}, {'_id': '990PF', 'count': 27932}, {'_id':

<br>Note that we now only have a database containing an index of the 3,912,171 990 filings. In order to actually use the 990 data, we will first have to download the filings listed in this database as well as create a data dictionary that can map out the variables we're interested in. The next set of tutorials will cover those steps. 